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In  this  study,  we  analyze  how  unit  cost  and  other  parameters  affect  the  validity  of 
DOD  metric  results.  Our  research  included  a  review  of  academic  literature  on  forecast 
accuracy  measurement  that  uncovered  an  alternative  metric.  Mean  of  Absolute  Scaled 
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We  found  the  DOD  metric  produced  non-intuitive  results  and  was  adversely 
affected  by  unit  cost  and  demand  volume,  while  MASE  avoided  these  errors.  We  utilized 
MASE  to  compare  six  forecasting  methods  and  found  that  flexibility  in  choice  of 
forecasting  method  produced  better  results  than  the  naive  method  when  coefficient  of 
variation  (CV)  is  below  2.0. 

We  recommend  that  the  DOD  and  Navy  adopt  MASE  for  aggregation  and  item- 
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I.  INTRODUCTION 


A,  BACKGROUND 

In  an  environment  of  inereasing  congressional  pressure  and  decreasing  defense 
funding,  the  Department  of  Defense  (DOD)  has  been  investing  considerable  effort  and 
resources  into  increasing  the  efficiency  and  effectiveness  of  its  supply  chain 
management.  To  set  a  common  understanding  of  that  issue,  the  next  section  provides  a 
summary  of  the  pivotal  events  that  led  to  the  Comprehensive  Inventory  Management 
Improvement  Plan  (CIMIP),  which  aims  to  reduce  excess  DOD  secondary  inventory. 
“DOD  defines  secondary  items  as  minor  end  items;  replacement,  spare,  and  repair 
components;  personnel  support  and  consumable  items.  Examples  of  secondary  items 
include  aircraft,  tank,  and  ship  components;  construction,  medical,  and  dental  supplies; 
and  food,  clothing,  and  fuel”  (General  Accounting  Office  [GAO],  1988,  p.  1).  Principal 
inventory  items  consist  of  items  such  as  aircraft,  vehicles  and  ships.  DOD  stratifies 
secondary  inventory  into  four  categories:  approved  acquisition  objective,  economic 
retention  stock,  contingency  retention  stock,  and  potential  reutilization  stock.  The 
approved  acquisition  objective  stock  is  calculated  in  order  to  meet  current  requirements, 
while  the  other  three  categories  are  considered  by  GAO  to  be  in  excess  of  current 
requirements  (Government  Accountability  Office  [GAO],  2015b).  While  not  directly 
stated,  the  DOD  appears  to  only  consider  potential  reutilization  stock  as  excess  and  seems 
reluctant  to  dispose  of  economic  and  contingency  retention  stocks  due  to  the  potential 
that  they  will  be  needed  in  the  future.  Figure  1.  shows  how  much  of  the  Navy’s 
secondary  inventory  was  considered  excess  in  fiscal  years  2004  through  2007. 

1.  Pre  CIMIP 

On  September  8,  1982,  the  U.S.  Congress  enacted  the  Federal  Managers  Financial 
Integrity  Act  (FMFIA).  Primarily  an  amendment  to  the  Accounting  and  Auditing  Act  of 
1950,  it  required  “ongoing  evaluations  and  reports  of  the  adequacy  of  the  systems  of 
internal  accounting  and  administrative  control  of  each  executive  agency”  (Federal 
Managers  Financial  Integrity  Act  of  1982,  2012).  While  implementation  of  the  act  did  not 
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immediately  solve  the  issues  that  it  intended  to  address  (GAO,  1989),  it  became  a  driving 
force  behind  the  ongoing  efforts  to  improve  the  way  that  the  federal  government  manages 
resources. 

In  July  1988,  the  General  Accounting  Office,  as  GAO  was  known  at  the  time, 
published  a  report  in  response  to  Senate  inquiries  regarding  the  growth  of  secondary  item 
inventories  within  the  DOD  (GAO,  1988).  Between  1980  and  1987,  according  to  the 
report,  the  value  of  the  DOD’s  secondary  items  grew  from  $43  billion  to  $94  billion, 
about  $19  billion  of  which  was  attributable  to  the  Navy.  Of  this  $51  billion  dollar 
increase,  $27  billion  was  due  to  the  increasing  size  of  the  U.S.  military,  while  $19  billion 
was  considered  to  be  in  excess  of  requirements  and  $5  billion  was  “unstratified,”  which 
means  that  it  was  not  allocated  to  a  specific  inventory  purpose  such  as  current 
requirement  or  economic  retention.  This  report  contributed  to  the  growing  number  of 
GAO  studies  concluding  that  DOD  needed  to  do  a  better  job  of  managing  its  inventory. 

On  January  23,  1990,  GAO  released  a  letter  from  the  comptroller  general  (CG)  of 
the  United  States  addressed  to  the  chairman  of  the  U.S.  Senate  committee  on 
governmental  affairs  and  the  chairman  of  the  U.S.  House  of  Representatives  committee 
on  government  operations  (GAO,  1990).  In  the  letter,  the  CG  highlights  the  need  to 
improve  the  internal  controls  and  financial  management  systems  of  the  federal 
government.  In  October  1989,  in  support  of  the  Office  of  Management  and  Budget 
(0MB)  identification  of  “high  risk”  areas,  and  after  reviewing  reports  submitted  under 
FMFIA,  the  CG  identified  14  target  areas  that  would  receive  special  attention  from  the 
GAO.  One  of  those  areas  singled  out  for  special  review  was  the  DOD  inventory 
management  systems,  due  to  growing  excess  inventory  levels  now  valued  at  over  $30 
billion,  and  numerous  other  indicators  of  poor  financial  management.  Since  that  time,  the 
GAO  has  considered  the  DOD’s  inventory  management  a  high-risk  area,  and  although 
the  name  of  the  problem  has  changed  to  DOD  supply  chain  management,  it  remains  one 
of  the  32  high-risk  areas  on  the  GAO’s  2015  list  (GAO,  2015a). 

In  December  2008,  the  GAO  published  a  report  that  evaluated  the  cost  efficiency 

of  the  Navy’s  spare  parts  inventory.  In  explaining  why  the  Navy  had  accumulated  excess 

secondary  inventory,  the  report  concluded,  “much  of  the  inventory  that  exceeded  current 

2 


requirements  or  had  inventory  deficits  resulted  from  inaccurate  demand  forecasts”  (GAO, 
2008,  p.  34).  The  report  also  documented  the  results  from  surveys  of  the  Navy’s  Item 
Managers  (IM)  who  identified  many  additional  factors  that  they  felt  were  contributing  to 
inventory  excesses  and  deficits  (GAO,  2008).  From  2004  to  2007,  GAO  calculated  that 
secondary  inventory  in  excess  of  current  requirements  averaged  about  40%,  or  $7.5 
billion,  of  total  Navy  inventory.  Figure  1.  from  the  report  shows  this  trend  in  2007 
dollars.  This  report  was  the  second  in  a  series  of  GAO  reports  that  reviewed  the 
secondary  inventory  management  of  the  Air  Force  (GAO,  2007),  Army  (GAO,  2009)  and 
Defense  Logistics  Agency  (DLA)  (GAO,  2010).  To  varying  degrees,  each  of  these 
reports  commented  on  the  need  for  improved  demand  forecasting.  Subsequently,  GAO 
concluded  that  “inaccurate  demand  forecasting  is  the  leading  reason  for  the  accumulation 
of  excess  inventory”  (GAO,  201 1,  p.  11)  throughout  the  services  and  DLA. 

Figure  1.  Navy  Secondary  Inventory  Meeting  and  Exceeding  Requirements 

(FY  2004-2007).  Source:  GAO  (2008). 

Dollars  (in  billions) 


Fiscal  year 

Beyond  current  requirements 
Current  requirements 
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After  20  years  of  effort  with  little  improvement,  Congress  inserted  language  into 
the  fiseal  year  (FY)  2010  National  Defense  Authorization  Aet  (NDAA)  that  required  the 
development  of  an  extensive  plan  that  would  improve  the  inventory  management 
practices  within  the  DOD.  When  the  NDAA  was  enacted  on  October  28,  2009,  section 
328  required  that  this  plan  be  provided  to  Congress  for  review  within  270  days.  The  plan 
was  required  to  address  eight  separate  elements  intended  to  improve  “the  inventory 
management  systems  of  the  military  departments  and  the  Defense  Logistics  Agency  with 
the  objective  of  reducing  the  acquisition  and  storage  of  secondary  inventory  that  is  excess 
to  requirements”  (NDAA,  2009,  para  [a]).  The  most  relevant  aspect  to  this  research  is  the 
second  part  of  the  first  element,  which  required  the  “development  of  metrics  to  identify 
bias  toward  over-forecasting  and  adjust  forecasting  methods  accordingly”  (NDAA,  2009, 
para  [b(l)]).  This  legal  requirement  would  eventually  result  in  the  DOD  developing  a 
common  metric  for  forecast  accuracy  and  forecast  bias  that  would  measure  the 
performance  of  each  military  service  and  DLA. 

2,  CIMIP 

As  required  by  the  FYIO  NDAA  section  328,  the  Assistant  Secretary  of  Defense 
for  Logistics  and  Materiel  Readiness  published  the  DOD’s  Comprehensive  Inventory 
Management  Improvement  Plan  in  October  2010.  In  addition  to  fulfilling  the  demands  of 
Congress,  the  objective  of  the  plan  was  to  drive  “a  prudent  reduction  in  current  inventory 
excesses  as  well  as  a  reduction  in  the  potential  for  future  excesses  without  degrading 
materiel  support  to  the  customer”  (Assistant  Secretary  of  Defense  for  Logistics  and 
Material  Readiness  (ASD[L&MR]),  2010,  p.  hi).  In  that  document,  chapter  one  contains 
an  overview  of  inventory  management  improvement,  assigns  responsibilities  and 
highlights  the  implementation  strategy.  Chapters  two  through  nine  detail  the  eight  sub¬ 
plans  that  have  been  developed  to  address  the  eight  elements  required  by  section  328, 
while  chapter  ten  details  four  additional  improvement  actions  that  the  DOD  is  developing 
on  their  own  initiative.  Although  these  department-wide  actions  were  not  specifically 
required  by  section  328,  they  were  included  in  the  plan  because  “these  actions  support  the 
Department’s  intent  to  improve  DOD  inventory  management  and  reduce  excesses” 

(ASD[L&MR],  2010,  p.  10-1).  Appendix  A  lists  17  other  DOD  strategies,  plans,  or 
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efforts  that  are  eonsistent  with  the  CIMIP  overall  objeetive  of  redueing  seeondary  item 
inventory  levels.  Most  importantly,  Appendix  A  highlights  that  the  plan  is  eonsistent  with 
the  objeetives  of  the  DOD  Logisties  Strategie  Plan,  whieh  “identifies  high  level  goals, 
performanee  measures,  and  key  initiatives  that  support  the  DOD  priorities  and  drive  the 
logisties  enterprise  improvements”  (ASD[L&MR],  2010,  p.  A-1).  Appendix  B  lists  the  12 
GAO  reports  published  between  Mareh  2006  and  May  2010  that  are  related  to  seeondary 
item  inventory,  summarizes  their  findings,  and  briefly  states  how  the  plan  will  address 
eaeh  finding.  Appendix  C  reprints  the  entirety  of  seetion  328  of  the  FYIO  NDAA,  while 
Appendix  D  provides  a  list  of  abbreviations. 

While  the  plan  is  a  eomprehensive  approaeh  to  improving  materiel  management, 
only  ehapter  II,  Sub-Plan  A:  Demand  Forecasting,  is  relevant  to  our  researeh.  The  overall 
objeetive  of  sub-plan  A  “is  to  improve  the  predietion  of  future  demands  so  that  inventory 
requirements  more  aeeurately  refieet  aetual  needs”  (ASD[L&MR],  2010,  p.  2-3).  In  order 
to  aeeomplish  this  objeetive,  the  DOD  did  a  thorough  review  of  eurrent  foreeasting 
proeedures  and  methodologies  in  seareh  of  ways  to  improve  the  proeess.  As  a  result  of 
this  review,  the  DOD  established  five  aetion  items  that  required  further  work  to  address 
the  issues  with  demand  foreeasting.  Of  these  five  aetion  items.  Action  A-2:  Implement 
Standard  Metrics  to  Assess  Forecasting  Accuracy  and  Bias  is  the  basis  for  this  researeh 
projeet.  DOD  targeted  the  end  of  fiseal  year  2011  to  identify  these  two  me  trios  and  the 
end  of  fiseal  year  2012  to  establish  the  prooesses  by  whieh  the  DOD  oomponents  oould 
set  targets  and  begin  utilizing  the  oommon  me  trios.  The  aoouraoy  metrio  intends  to 
measure  foreoast  performanee  while  minimizing  bias  and  generating  results  for  various 
inventory  segments.  The  bias  metrio  intends  to  identify  over-  and  under-foreoasts  in  order 
to  prevent  inventory  exoesses  and  defioits. 

3.  Post  CIMIP 

In  January  2011,  GAO  published  its  required  60-day  assessment  of  the  DOD’s 
plan  to  meet  the  eight  elements  identified  in  seetion  328  (GAO,  2011).  While  GAO 
oonoluded  that  the  plan  did  address  all  eight  elements  from  seetion  328  of  the  FYIO 
NDAA,  the  report  identified  five  general  areas  that  oould  produoe  implementation 
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challenges  if  not  managed  properly.  One  of  the  examples  the  report  used  to  highlight 
potential  friction  areas  was  the  requirement  to  develop  a  standard  accuracy  metric  and 
performance  targets.  GAO  felt  that  this  level  of  standardization  could  be  diffieult  to  reaeh 
given  the  fact  that  the  services  and  DLA  had  different  approaehes  to  measuring  demand 
foreeast  accuracy  (GAO,  2011,  p.  6).  In  May  2012,  GAO  fulfilled  its  final  requirement 
from  section  328  by  publishing  its  18-month  assessment  of  the  effectiveness  in  which  the 
serviees  and  DLA  have  implemented  the  plan  they  developed.  GAO  eoncluded  that  while 
the  DOD  was  “making  progress  towards... establishing  a  department-wide  set  of 
standardized  metrics  for  inventory  management.  Moving  forward,  DOD’s  inventory 
management  improvement  efforts  would  benefit  from  challenging,  but  achievable  targets 
for  redueing  its  on-order  and  on-hand  exeess  inventory”  (GAO,  2012,  p.  30).  Within  the 
demand-forecasting  sub-plan,  GAO  determined  that  while  DOD  had  sueeessfully 
developed  the  forecast  accuracy  and  bias  metrics,  the  effeetive  implementation  of  these 
metrics  still  required  a  sustained  effort  to  meet  the  expeeted  completion  date  of 
September  2012.  The  aecuracy  metrie  that  was  developed  is  an  absolute  error  metric, 
while  the  bias  metric  is  a  signed  error  metric.  The  formulas  for  these  two  metrics  are 
discussed  further  in  Chapter  11  and  are  shown  in  Equations  (2.24)  and  (2.25). 

Reinforcing  CIMIP  efforts,  the  aeting  Under  Secretary  of  Defense  for 
Acquisition,  Technology,  and  Logistics  (USD[AT&L])  signed  DOD  Instruction  4140.01 
in  December  2011,  establishing  that  DOD’s  supply  chain  materiel  management  “shall 
operate  as  a  high-performing  and  agile  supply  ehain  responsive  to  customer  requirements 
during  peacetime  and  war  while  balancing  risk  and  total  cost”  (Kendall,  2011).  In 
addition  to  clearly  defining  policy  and  assigning  responsibility  for  management  of 
material  across  the  DOD  supply  ehain,  this  instruction  laid  out  the  framework  for  11 
DOD  Supply  Chain  Material  Management  Procedures  manuals.  In  February  2014,  the  1 1 
manuals  were  published  as  volumes  1  through  11  of  DOD  Manual  4140.01  with  eaeh 
covering  speeifie  supply  chain  procedures.  Volume  2,  Demand  and  Supply  Planning, 
among  other  things  provided  guidanee  on  how  DOD  components  should  forecast 
customer  demand.  Volume  10,  Metrics  and  Inventory  Stratification  Reporting,  required 
among  other  things  that  the  DOD  utilize  metries  that  were  specific,  measureable. 
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actionable,  realistic,  and  timely,  which  included  demand  forecast  accuracy  as  an  example 
of  such  a  metric. 

In  April  2015,  GAO  released  its  most  recent  report  related  to  defense  inventory 
management,  concluding  that  the  services  had  generally  been  able  to  reduce  their  excess 
inventory,  which  was  the  primary  objective  of  section  328  of  the  FYIO  NDAA.  Although 
this  result  was  positive,  GAO  had  seven  recommendations  to  improve  how  DOD 
managed  inventory.  While  GAO  recommended  that  DOD  establish  goals  for  these 
metrics,  DOD  wanted  to  collect  more  data  to  establish  a  performance  baseline  before 
setting  any  department- wide  goals  (GAO,  2015b,  p.  43).  The  report  also  reviewed  results 
from  the  first  and  second  metrics  reporting  periods.  These  metric  results  are  reported 
semi-annually  for  the  preceding  12-month  period,  so  the  first  period  covered  all  12 
months  of  FY13  ending  in  September  2013.  The  second  period  covered  the  last  six 
months  of  FY13  and  the  first  six  months  of  FY14  ending  in  March  2014.  Figure  2.  and 
Figure  3.  show  the  results  reported  by  three  services  during  these  two  12-month 
reporting  periods.  The  figures  do  not  include  the  results  for  DLA  or  the  non-aviation 
material  for  the  Marine  Corps.  The  Marine  Corps  aviation  material  is  included  in  the 
Navy  results. 
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Figure  2.  Demand  Forecast  Accuracy  Performance  by  Service.  Source:  GAO 

(2015). 
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The  Air  Force  reported  the  highest  forecast  accuracy  for  these  periods.  The  Army  showed 
the  greatest  improvement  over  the  two  reporting  periods. 


Figure  3.  Demand  Forecast  Bias  by  Service.  Source:  GAO  (2015). 
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The  Army  had  the  largest  bias  for  over-forecasting  demand,  followed  by  the  Navy  and  the 
Air  Force.  In  the  second  reporting  period,  the  Air  Force  reported  a  negative  bias,  which 
indicates  that  they  were  under-forecasting  their  demand. 
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In  response  to  the  Navy’s  relatively  poor  performanee  in  both  forecast  accuracy 
and  bias,  Naval  Supply  Systems  Command  (NAVSUP)  reported  that  they  were 
“reviewing  and  analyzing  their  demand  forecasting  processes  and  planning  factors  to 
improve  performance  on  DOD’s  forecast  accuracy  and  bias  metrics  tracked  across  the 
department”  (GAO,  2015b,  p.  46). 

B,  DATA  DESCRIPTION  AND  RECENT  RESULTS 

The  business  rules  for  calculating  the  DOD’s  demand  forecasting  accuracy  and 
bias  metrics  that  were  provided  to  DLA  and  each  of  the  services  specify  eight  forecast 
data  elements  and  two  demand  history  data  elements  that  should  be  included  in  their  data 
captures  (DOD,  2013).  These  elements  were 

Forecast  Data  Elements 

•  NUN  /  family  head  /  subgroup  master 

•  Demand  forecast  (monthly/quarterly/semi-annually) 

•  Latest  acquisition  cost  or  moving  average  cost 

•  Reparable/consumable  indicator 

•  Unit  of  issue 

•  Unit  of  measure 

•  Time  frame  of  the  forecast  (start  date) 

•  Date  the  forecast  was  made  (forecast  date) 

Demand  History  Data  Elements 

•  Actual  demand 

•  Timeframe  of  demand 

NAVSUP  Weapon  Systems  Support  (WSS)  provided  CIMIP  compliant  data  for 
fiscal  years  2013,  2014  and  2015.  The  raw  data  elements  provided  the  national  individual 
identification  number  (NUN),  quarterly  demand  forecast,  repair  indicator,  stock  routing 
code,  replacement  cost,  acquisition  advice  code,  performance  based  logistics  indicator, 
family  group  code,  unit  of  measure,  life  cycle  indicator  (LCI),  cognizance  code,  actual 
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annual  demand,  and  annual  naive  forecast.  The  FY14  and  FY15  data  calculated 
additional  elements  such  as  annual  demand  forecast,  total  dollar  calculations,  absolute 
and  signed  errors,  and  line  item  forecast  accuracy  and  bias  metrics.  The  FY15  data  also 
included  a  bar  graph  of  the  Navy’s  overall  CIMIP  results  reported  to  DOD  for  the  five 
previous  12-month  evaluation  periods  (Figure  4). 


Figure  4.  Navy  CIMIP  Forecast  Metric  Results  FY13-FY15.  Source: 

NAVSUP  (2015). 
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Accuracy  and  bias  results  are  reported  to  DOD  semi-annually  for  the  preceding  12-month 
period,  which  creates  a  six-month  overlap  in  the  data.  The  accuracy  result  is  an  absolute 
error  metric  that  summarizes  the  Navy’s  forecasting  performance.  The  bias  result  is  a 
signed  error  metric  that  represents  the  degree  of  over-forecasting 


C.  PURPOSE  AND  BENEFITS  OF  STUDY 

This  research  effort  intends  to  review  the  validity  of  the  DOD’s  newly 
implemented  CIMIP  forecasting  metrics  and  identify  weaknesses  that  may  not  be 
apparent  to  the  casual  observer  of  forecast  accuracy  metrics.  We  also  intend  to  provide 
recommendations  that  will  improve  the  DOD’s  efforts  to  increase  forecast  accuracy, 
which  should  result  in  better  forecasts  in  the  future,  decreasing  levels  of  excess  inventory 
and  ultimately,  substantial  cost  savings  to  the  DOD.  While  we  certainly  appreciate  the 
complexity  of  forecasting  future  demand  and  accurately  measuring  those  forecasts,  and 
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recognize  the  amount  of  effort  that  has  already  been  devoted  to  this  issue,  we  will 
demonstrate  that  our  research  can  provide  value  to  these  DOD  efforts.  Even  if  the  DOD 
disregards  our  recommendations,  there  are  still  opportunities  for  the  Navy,  or  the  other 
services,  to  implement  our  recommendations  and  improve  their  demand  forecasting 
efforts. 

D,  RESEARCH  QUESTIONS 

In  2011,  GAO  declared  that  “inaccurate  demand  forecasting  is  the  leading  reason 
for  the  accumulation  of  excess  inventory”  (p.  11),  and  as  Figure  5  shows,  the  Navy  has 
been  making  steady  progress  in  reducing  its  on-hand  excess  inventory;  however,  despite 
this  good  news,  the  CIMIP  forecast  results  have  not  significantly  changed  (Figure  4). 


Figure  5.  Navy  On-Hand  Excess  Inventory,  Sept.  2012  to  Mar.  2014. 

Source:  GAO  (2015). 
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The  vertical  bars  represent  excess  inventory  as  a  percentage  of  total  inventory.  The  bottom 
table  shows  inventory  dollar  values  in  billions.  While  total  inventory  value  has  remained 
constant,  whether  you  exclude  contractor-managed  inventory  or  not,  excess  inventory  has 
been  decreasing  in  real  dollar  values  and  as  a  percentage  of  total  inventory. 
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Although  many  factors  contribute  to  excess  inventory  levels,  if  the  “leading 
reason  for  the  aeeumulation  of  exeess  inventory”  (GAO,  2011,  p.  11) — foreeast  aecuraey 
— is  not  improving  or  getting  worse  while  exeess  inventory  is  deelining,  then  this  raises 
the  question  of  whether  demand  foreeasting  is  aetually  the  largest  eontributor;  or 
alternatively,  if  foreeasting  performanee  is  not  being  aeeurately  measured.  Our  intuition 
is  that  the  answer  lies  in  the  seeond  justifieation,  and  we  intend  to  show  it  by  addressing 
the  following  questions: 

•  Does  the  CIMIP  foreeasting  metrie  eapture  foreeast  error  in  a  way  that  is 
aetionable? 

•  Are  the  CIMIP  foreeasting  aeeuraey  results  impacted  by  variables  or  data 
set  charaeteristics  that  are  not  direetly  related  to  foreeast  error? 

•  Does  the  CIMIP  foreeasting  metrie  provide  a  useful  produet  to  the 
foreeasters  that  enables  them  to  prioritize  their  foreeast  improvement 
efforts? 

•  Is  there  an  alternative  foreeast  aeeuraey  equation  that  both  enables  the 
aggregation  of  aeeuraey  results  for  multiple  line  items,  with  various  units- 
of-measure,  while  also  providing  actionable  results  at  the  item  level? 

Finally,  it  is  also  important  to  investigate  how  the  foreeast  aeeuraey  can  generate 
valuable  information  to  the  Navy’s  managers. 

E.  SCOPE,  ORGANIZATION  AND  METHODOLOGY 

While  inventory  management  improvement  efforts  span  a  large  range  of  topies 
detailed  in  the  2010  CIMIP,  this  researeh  will  focus  primarily  on  one  line  of  effort  to 
improve  demand  foreeasting:  the  measurement  of  foreeast  aeeuraey.  Chapter  II  is  a 
summary  of  the  traditional  aeademie  aeeuraey  metrics  and  a  compilation  of  the  most 
valuable  findings  in  the  existing  literature.  Chapter  III  aims  to  present  an  in-depth 
analysis  of  the  CIMIP  equation  and  a  eomparison  to  an  alternative  aeeuraey  metrie.  Mean 
Absolute  Sealed  Error  (MASE).  Those  analyses  are  eomposed  of  speeifie  tests  to  uneover 
the  existenee  of  inherent  flaws  or  undesirable  eharaeteristies  in  the  eurrent  metrie.  In 
order  to  eompare  the  aecuraey  metrics,  we  assess  them  utilizing  four  desirable 
eharaeteristies.  The  tests  we  will  eonduet  utilize  three  different  methods,  aceording  to 
speeifie  purposes.  The  first  method  uses  fictional  numbers,  the  seeond  uses  real  numbers 
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extracted  from  available  data,  and  the  third  generates  Monte  Carlo  Simulations  using  the 
Crystal  Ball  program. 

Although  we  did  not  intend  to  make  this  a  discussion  on  forecasting  methods,  the 
interrelatedness  of  forecasting  methods  and  results  measurement  make  it  unavoidable. 
Therefore,  in  Chapter  IV,  we  analyze  alternative  ways  to  generate  more  accurate  demand 
forecasts.  In  Chapter  V,  we  summarize  the  most  important  findings,  make 
recommendations  for  DOD  and  Navy,  and  propose  future  areas  of  research  to  continue  to 
advance  the  effectiveness  of  DOD  and  Navy  forecasting  efforts. 
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II.  LITERATURE  REVIEW 


A,  INTRODUCTION 

This  chapter  presents  a  review  of  the  evolutionary  path  of  knowledge  in  the  field 
of  foreeast  aeeuraey,  while  also  providing  an  overview  of  the  most  popular  foreeast 
aceuraey  measures. 

B,  FORECAST  ACCURACY 

Demand  foreeasts  are  a  key  eomponent  to  effeetive  inventory  management. 
Delivery  of  inputs  to  produetion  takes  time  and,  even  considering  a  deterministic 
scenario,  managers  need  to  be  preeise  in  determining  the  correet  time  to  transmit  their 
orders  to  suppliers,  in  order  to  avoid  eosts  from  shortages  or  by  holding  excess  inventory. 

Reinforeing  that  idea,  Makridakis  and  Hibon  (2000)  elaim  that  “foreeasting 
aeeuraey  is  a  eritieal  faetor  for,  among  other  things,  redueing  eosts  and  providing  better 
customer  service”  (p.  451).  The  effeets  of  an  inaoeurate  predietion  are  intensified  when 
variability  takes  plaee,  making  the  importanee  of  foreeast  aeeuraey  even  more  important. 

1.  History  of  Forecast  Accuracy  Measurement 

Over  the  last  50  years,  researehers  have  invested  eonsiderable  time  and  effort  to 
inerease  the  understanding  of  foreeast  accuracy.  While  there  is  not  a  eonsensus  about  the 
first  aeademie  artiele  on  forecast  accuracy,  Ferber  (1956)  and  Schupack  (1962)  are 
eonsidered  pioneers  in  this  field.  They  tested  multiple  foreeasting  methods,  using 
eorrelation  index  and  various  aeeuraey  metries  to  determine  whether  foreeast  models  that 
demonstrated  a  good  fit  to  past  data  eould  then  generate  good  foreeasts.  The  results  did 
not  support  this  hypothesis  and  they  eoneluded  that  best  fit  on  past  data  is  not  a  good 
measure  of  foreeast  aeeuraey.  Moreover,  forecast  method  rankings  do  not  ehange  mueh 
by  using  different  foreeast  aeeuraey  metrics  and  there  is  no  absolute  best  foreeast  method. 

As  eomputer  proeessing  eapabilities  grew,  researehers  eould  proeeed  with  broader 
studies  to  measure  the  aeeuraey  of  different  foreeast  methods.  Fildes  and  Makridakis 
(1995)  found  that  in  the  20  years  from  1971-1991,  approximately  130  articles  per  year 
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were  published  in  the  Journal  of  the  American  Statistical  Association  (JASA)  related  to 
time  series  analysis.  For  example,  Newbold  and  Granger  (1974)  used  average  squared 
foreeast  errors  to  assess  the  aecuracy  of  three  forecast  methods,  each  one  applied  to  106 
time  series.  A  few  years  later.  Nelson  and  Granger  (1979)  were  able  to  analyze  five 
forecast  methods  through  twenty-one  time  series,  utilizing  ten  forecast  horizons  and  two 
different  accuracy  metrics. 

Newbold  and  Granger  (1974)  were  able  to  make  insightful  conclusions  regarding 
the  use  of  non-automated  forecast  methods.  They  found  that  the  Box-Jenkins  forecast 
method  was  capable  of  making  up  for  its  significantly  longer  calculating  time  by 
producing  more  accurate  estimations.  Moreover,  results  from  that  forecasting  method 
could  be  further  improved  by  combining  them  with  other  fully  automated  procedures, 
like  Holt-Winters  or  a  stepwise  autoregressive  forecast.  They  also  provided  guidelines  to 
optimize  the  choice  of  forecast  methods  according  to  the  length  of  the  time  series.  The 
idea  of  combining  forecast  methods  in  order  to  increase  accuracy  is  one  of  the  most 
valuable  contributions  in  the  field  of  forecasting  and  was  first  investigated  by  Reid 
(1968)  and  was  further  discussed  by  Nelson  (1972),  Cooper  and  Nelson  (1975), 
Makridakis  and  Winkler  (1983),  Nelson  (1984),  Clemen  (1989),  Fildes  (1989),  among 
many  others. 

By  the  late  1970s,  the  question  of  what  is  the  best  forecast  method  seemed  to  be 
far  from  a  solid  answer.  Utilizing  the  increasing  power  of  computing  capabilities  and 
availability  of  new  knowledge  in  the  field  of  time  series,  Makridakis  et  al.  (1979)  and 
(Makridakis  et  ah,  1982)  conducted  accuracy  analysis  on  a  much  greater  number  of 
forecast  methods. 

Moreover,  Makridakis  et  al.  (1982)  was  the  first  empirical  study  of  what  became 
known  as  the  M-1  Competition,  which  began  the  M-series  Competitions.  Makridakis  et 
al.  (1993)  and  Makridakis  and  Hibon  (2000)  published  the  M-2  and  M-3  Competitions, 
respectively,  which  attempted  to  uncover  situations  in  which  one  forecast  method  is 
expected  to  outperform  others. 
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M-1  Competition  in  1982  was  based  on  the  consensus  of  nine  authors  and  made 
important  contributions  to  the  literature.  It  analyses  24  different  forecast  methods  using 
1,001  time  series  and  five  accuracy  metrics:  Mean  Average  Percentage  Error  (MAPE), 
Mean  Squared  Error  (MSE),  Average  Ranking  (AR),  Medians  of  Absolute  Percentage 
Errors  (MdAPE),  and  Percentage  Better  (PB).  The  major  findings  of  M-1  Competition 
are  that  there  is  no  forecast  method  capable  of  minimize  forecast  errors  in  all  kinds  of 
demand  patterns;  more  complex  forecast  methods  do  not  always  outperform  rudimentary 
ones;  and  the  best  technique  changes  from  one  forecast  horizon  to  the  next  and  when 
different  measures  of  accuracy  are  used. 

The  M-1  Competition  also  developed  the  categorization  of  time  series  in  order  to 
allow  for  the  possibility  of  one  technique  to  perform  better  when  specific  circumstances 
are  present.  That  method  is  in  accordance  to  Gilchrist  (1979),  which  affirmed  that 
averaging  accuracy  measures  for  several  time  series  might  hide  the  ability  of  a  forecast 
method  to  deal  with  one  specific  type  of  time  series  better  than  others.  However,  one  may 
infer  that  the  way  the  time  series  were  then  grouped  may  have  influenced  the  results. 

Those  findings  were  criticized  by  Armstrong  and  Eusk  (1983)  who  identified  the 
lack  of  interpretation  or  discussion  about  the  results  as  an  opportunity  to  open  a 
discussion  among  experts  aiming  to  clarify  important  aspects  of  forecast  accuracy. 

In  order  to  address  critics  related  to  organization  of  results,  M-2  Competition  in 
1993  made  a  simpler  analysis,  evaluating  16  forecast  methods,  each  one  applied  to  29 
time  series,  just  using  one  accuracy  metric,  MAPE.  It  concludes  in  favor  of  both  the 
exponential  smoothing  and  the  Dampen  and  Single  smoothing  methods,  considered  as 
being  among  the  simplest.  It  also  found  that  relatively  sophisticated  forecast  methods  are 
expected  to  perform  better  when  randomness  of  series  is  small. 

The  M-3  Competition  in  2000  moved  back  to  extensive  analysis,  while  as  many 
as  3000  time  series  were  used  to  generate  forecasts,  using  24  different  methods,  which 
accuracy  were  measured  by  five  metrics:  MAPE,  AR,  Median  Symmetric  Absolute 
Percentage  Error  (MdSAPE),  PB  and  Median  of  Relative  Absolute  Error  (MdRAE).  It 
rejected  the  argument  that  more  complex  methods  outperform  simpler  ones.  It  found  that 
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the  best  method  varies  aeeording  to  the  aeeuracy  metrie  used  and  that  a  eombination  of 
foreeast  methods  is  able  to  inerease  foreeast  aeeuracy. 

Armstrong  and  Collopy  (1992)  presented  a  different  approach  on  the  use  of 
forecast  accuracy,  as  it  evaluated  measures  of  forecast  accuracy,  instead  of  forecast 
methods  themselves,  by  using  191  economic  time  series.  They  provided  a  new  approach 
to  judge  accuracy  metrics,  by  using  a  framework  composed  by  reliability,  construct 
validity,  sensitivity  to  small  changes,  protection  against  outliers,  and  relationship  to 
decision  making.  Final  conclusions  were  favorable  to  the  use  of  MdRAE  as  an  accuracy 
metric. 

Following  that  discussion,  Hyndman  and  Koehler  (2006)  provide  a 
comprehensive  critical  survey  of  accuracy  measures  to  uncover  significant  inadequacy  in 
all  of  them.  They  sort  the  accuracy  metrics  into  five  categories:  scale-dependent 
measures,  measures  based  on  percentage  errors,  measures  based  on  relative  errors, 
relative  measures  and  scaled  errors;  describe  each  category  and  provide  critical  analysis 
of  their  weaknesses.  Acknowledging  inherent  flaws  of  the  existing  accuracy  metrics,  they 
propose  MASE.  The  metric  was  retroactively  applied  to  the  M-3  Competition  data  to  test 
its  potential. 

The  most  important  findings  were  that  MASE  can  be  used  in  all  patterns  of 
demand,  that  it  produced  results  in  accordance  to  what  was  found  by  Makridakis  and 
Hibon  (2000)  about  best-performing  methods,  and  that  MASE  represented  a  more 
powerful  test  than  any  other  metrics,  since  its  results  show  more  significant  differences 
between  forecast  methods. 

Finally,  after  considering  the  existing  literature,  Fildes  et  al.  (2008)  claim  that 
“establishing  an  appropriate  measure  of  forecast  error  remains  an  important  practical 
problem  for  company  forecasting”. 

2,  Traditional  Academic  Measures  of  Forecast  Accuracy 

A  starting  point  to  discuss  forecast  accuracy  measurement  is  that  it  is  based  on 
observation  of  errors.  Those  errors  are  comparisons  between  the  demand  that  what  was 
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forecasted  for  a  given  period  of  time  and  actual  observation  during  that  same  time  period. 
Therefore,  the  most  basic  idea  about  forecast  accuracy  is  that  a  better  forecast  method  is 
expected  to  produce  smaller  errors. 

Furthermore,  forecast  accuracy  can  be  considered  a  two-dimensional  problem. 
One  can  think  in  terms  of  measuring  accuracy  over  many  periods  of  time  for  one  item, 
while  others  may  need  a  number  that  represents  the  goodness  of  forecast  method  for 
many  items  in  the  same  time  period.  Table  1.  exemplifies  the  generation  of  forecast 
accuracy  values  in  both  dimensions  mentioned. 


Table  1.  The  Two  Dimensions  of  Forecast  Accuracy 


Time 


1 

2 

3 

Mean  of  Absolute  Errors 

Items 

f 

a 

Abs  Error 

f 

a 

Abs  Error 

f 

a 

Abs  Error 

per  item 

1 

7 

9 

2 

1 

2 

1 

0 

5 

5 

2.67 

2 

0 

9 

9 

5 

7 

2 

0 

3 

3 

4.67 

3 

4 

1 

3 

5 

8 

3 

4 

2 

2 

2.67 

4 

2 

0 

2 

3 

8 

5 

0 

3 

3 

3.33 

Mean  of  Absolute 

Errors  in  time  1 

4 

Mean  of  Absolute 

ii*rors  in  time  2 

2.75 

Mean  of  Absolute 

Errors  in  time  3 

3.25 

Mean  of  Absolute  Errors  is  one  of  the  existing  forecast  accuracy  metrics.  It  can  be 
calculated  either  at  the  line  item  level  or  at  the  aggregated  level,  for  each  period.  In  this 
case,  the  forecast  method  used  performed  better  for  items  1  and  4,  while  period  2  was  the 
time  in  which  the  overall  forecast  accuracy  was  considered  the  best.  Considering  the  scale 
dependency  of  that  metric,  discussed  in  the  Chapter  II,  this  hypothetic  data  set  assumes 
that  all  line  items  have  the  same  unit. 


First,  it  is  possible  to  isolate  one  time  series,  for  example,  the  repeated  demand  for 
one  item,  and  compute  the  accuracy  along  the  time,  which  is  called  by  Hyndman  and 
Athanasopoulos  (2014)  as  a  type  of  time  series  cross-validation.  Fildes  et  al.  (2008) 
reinforce  the  importance  of  this  process  by  claiming  that  forecasters  should  measure 
accuracy  as  a  result  of  sequential  errors. 

One  particular  way  to  conduct  such  analysis  is  to  calculate  errors,  for  specific 
times,  by  comparing  one  period  forecast  and  actual  values.  Afterwards,  there  is  a  variety 
of  ways  to  combine  errors  and  produce  significant  information  about  accuracy  of  forecast 
for  that  specific  item. 
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However,  Fildes  et  al.  (2008)  points  out  that  “a  eornmon  requirement,  within  an 
organization,  is  to  provide  a  one-figure  summary  error  measure,  for  many  different  time 
series”  (p.  1158).  That  proeedure  is  also  known  in  literature  as  aggregation,  whieh  is  both 
eritieized  and  defended  by  many  studies,  like  Jenkins  (1982),  Fildes  and  Makridakis 
(1995)  and  Hyndman  and  Koehler  (2006). 

In  order  to  enable  aggregation,  Fildes  and  Makridakis  (1995)  affirm  that  errors 
must  be  standardized.  In  fact,  Hyndman  and  Koehler  (2006)  applied  scaled  errors  as  a 
form  of  standardization,  thus  enabling  aggregation  by  simple  average. 

Therefore,  we  infer  that  an  effective  measure  of  accuracy  should  be  able  to 
produce  results  for  both  dimensions.  However,  as  we  could  not  find  any  further 
discussion  about  the  best  way  to  aggregate  accuracy  values,  hereafter,  we  are  going  to 
discuss  a  variety  of  metrics  used  to  calculate  forecast  accuracy  across  time,  which  are 
exhaustively  discussed  in  literature  and  often  used  by  organizations. 

To  do  so,  we  are  going  to  present  the  most  common  accuracy  metrics  using  the 
same  taxonomy  found  in  Hyndman  and  Koehler  (2006).  Basically,  we  review  the  many 
possibilities  of  handling  the  error,  which  is  calculated  as: 

(2.1) 

where: 

Ct  =  forecast  error  at  a  given  time 

ft  =  forecast  value  at  a  given  time 

Ut  =  actual  value  at  a  given  time 

a.  Scale-Dependent  Metrics 

Metrics  that  fall  in  this  category  generate  values  accompanied  by  their  respective 
units.  Their  use  has  to  be  restricted  to  series  cross-validation  in  order  to  avoid  the 
problem  of  mixing  units  of  different  items.  That  is  the  main  source  of  criticism  to  M-1 
Competition,  in  Makridakis  et  al.  (1982),  since  it  inappropriately  uses  the  MSB  across 
time  series. 
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The  most  common  scale-dependent  measures  are: 

Mean  Squared  Error 

MSE  -  Mean(e^f)  (2.2) 

Root  Mean  Squared  Error 

RMSE  =  ^mean{e^  (2.3) 

Mean  Absolute  Error 

MAE  =  mean  (2.4) 

Median  Absolute  Error 

MdAE  =  median  \e,  |  (2.5) 

All  equations  in  this  category  use  central  tendency  measures.  It  is  worth  noting 
that  means  and  medians  are  the  extreme  opposites  in  terms  of  sensitiveness  to  outliers. 
Hence,  large  errors  will  dominate  the  results  in  formulas  based  on  means  and  cause 
almost  no  change  in  results  of  formulas  based  on  medians.  Therefore,  in  both  cases  the 
quality  of  the  results  are  harmed. 

Additionally,  measures  that  use  squared  errors  have  the  potential  to  penalize  large 
deviations,  in  comparison  to  small  ones,  which  make  them  appear  attractive  to  some 
managers.  However,  their  use  was  tested  and  not  recommended  by  Armstrong  and 
Collopy  (1992)  and  Armstrong  (2001),  due  to  the  disproportional  harm  caused  by 
outliers. 

b.  Percentage  Errors  Metrics 

Hyndman  and  Koehler  (2006)  define  percentage  error  (p^)  by  the  following 
equation: 

p^=\^ej  (2.6) 

Means,  medians  and  squares  are  applied  to  pt  to  derive  new  forecast  accuracy 
metrics.  The  most  common  percentage  error  measures  found  in  literature  are: 

Mean  of  Absolute  Percentage  Error 

MAPE  =  me  an\p\  (2.7) 
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Median  of  Absolute  Pereentage  Error 


MdAPE  =  median 


(2.8) 


Root  Mean  Square  Pereentage  Error 

RMS  PE  =  ^mean{p^  (2.9) 

Root  Median  Square  Pereentage  Error 

RMdSPE  =  ^median{p^  (2.10) 

An  inherent  flaw  with  pereentage  error  (pd  is  that  it  produees  an  infinite  result 
when  a;  =  0.  Therefore,  none  of  these  me  tries  are  reeommended  in  data  sets  that  eontain 
aetual  demand  values  equal  to  zero. 

Additionally,  Tayman  and  Swanson  (1999)  state  that  “MAPE  does  not  meet  the 
eriterion  of  validity,  as  it  systematieally  overstates  the  average  error  of  estimates, 
therefore,  harming  the  degree  of  eorrespondenee  between  its  measures  and  aetual  values” 
(p.  299). 

Eurthermore,  Makridakis  et  ah,  (1993)  notieed  that  these  metries  also  penalize 
positive  and  negative  errors  differently  beeause  negative  errors  (et  <  0),  in  terms  of 
inventory,  are  limited  to  the  amount  of  the  aetual  value  (ad,  while  positive  errors  (et  >  0) 
are  unbounded.  In  order  to  deal  with  that,  he  defined  symmetrie  measures: 

Symmetrie  Mean  Absolute  Pereentage  Error 

sMAPE  =  mean(200  -  f,\/  {(^,  +  f, ))  (2.11) 

Symmetrie  Median  Absolute  Pereentage  Error 

sMdAPE  =  median{200 |a,  -  /^ |  /  (a^  +  /, ))  (2.12) 

However,  while  Hyndman  and  Koehler  (2006)  found  that  these  metries  redueed 
the  unwanted  effeets  eaused  by  small  aetual  demand  values,  it  did  not  eompletely  solve 
the  problem.  Moreover,  some  studies  proved  that  these  metries  are  not  as  symmetrie  as 
they  were  supposed  to  be,  Goodwin  and  Eawton  (1999)  and  Koehler  (2001). 
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c.  Relative  Error  Metrics 

These  metrics  are  based  on  the  division  of  an  error  produced  by  one  forecast 
method,  by  the  error  of  another  forecast  method,  which  serves  as  a  benchmark  method. 
Often,  the  benchmark  forecast  method  consists  of  just  a  replication  of  previous  period 
values,  which  Hyndman  and  Koehler  (2006)  define  as  random  walk.  That  procedure  is 
also  known  in  literature  as  the  naive  method  Makridakis  et  al.  (1993).  Hence,  relative 
error  (r,)  is  expressed  by  the  following  equation: 

r,^eje\  (2.13) 

where,  e* t  is  the  error  produced  by  the  benchmark  method,  at  time  t. 

The  most  common  relative  error  measures  are: 

Mean  Relative  Absolute  Error 

MRAE  =  mean\r^  |  (2.14) 

Median  Relative  Absolute  Error 

MdRAE  =  median  |  rj  (2.15) 

Geometric  Mean  Relative  Absolute  Error 

GMRAE  =  gmean  |  r,  |  (2.16) 

Scrutinizing  the  relative  error  equation,  we  found  that  it  is  inherently  flawed  when 
the  error  produced  by  the  benchmark  method  is  zero  and  relative  error  goes  infinite,  or 
very  small  benchmark  errors  induce  extremely  high  relative  errors. 

Regarding  that  issue,  Armstrong  and  Collopy  (1992)  proposed  a  particular  way  to 
soften  the  mentioned  effect  by  trimming  results,  the  so-called  Winsorizing.  Basically, 
they  attributed  fixed  values  when  benchmark  errors  are  under  or  above  certain  thresholds. 
According  to  Hyndman  and  Koehler  (2006),  this  procedure  increases  complexity  and 
inserts  arbitrariness. 
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d.  Relative  Metrics 

Instead  of  simply  dividing  errors,  these  metries  are  based  on  dividing  results  of 
one  aeeuraey  metrie,  regarding  errors  produeed  by  different  foreeast  methods.  Therefore, 
Relative  Mean  Absolute  Error  is  the  division  of  MAE  generated  by  one  foreeast  method 
by  MAE  generated  by  a  seeond  method,  hollowing  are  some  of  the  possible  metries: 

Relative  Mean  Absolute  Error 

RMAE  =  MAEJMAE^  (2.17) 

Relative  Root  of  Mean  Squared  Error 

RRMSE  =  RMSE^  /  RMSE,  (2.18) 

Relative  Median  Absolute  Error 

RMdAE  =  MdAE^  / MclAE^  (2.19) 

Relative  Mean  Absolute  Pereentage  Error 

RMAPE  =  MAPE^  /  MAPE^  (2.20) 

As  the  name  of  this  group  of  metries  suggest,  the  results  are  given  in  relation  to 
another  foreeast  method.  Henee,  values  from  zero  to  one  mean  better  foreeast,  eompared 
to  foreeast  method.  When  result  is  one,  there  is  no  signifieant  differenee  among  the 
eonsidered  foreeast  methods.  Results  bigger  than  one  mean  that  foreeast  method  used 
performed  worse  than  the  benehmark.  Hyndman  and  Koehler  (2006)  eonsider  the 
eharaeteristie  of  easy  interpretability  as  an  advantage  of  these  metries. 

The  only  limitation  found  is  that  it  is  impossible  to  use  these  metries  aeross  items, 
regarding  just  one  period  in  time,  sinee  they  use  scale  dependent  measures  in  numerator 
and  denominator  that  do  not  allow  aggregation  of  different  time  series. 

Wheelwright  et  al.  (1998)  mentions  a  specific  relative  metric,  called  TheiTs  U 
Statistic  and  its  variation,  TheiTs  U-2  Statistic.  Theil  developed  the  first  of  those  metrics 
in  1966,  and  it  was  modified  into  the  second  one  in  1978.  The  article  claims  that  TheiTs 
U-2  statistic  is  just  a  particular  case  of  RMAE,  when  the  benchmark  method  is  the  naive 
and  forecasts  are  generated  to  one  period  ahead. 
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Another  metric  that  uses  the  same  principle  of  relative  measures  is  PB.  It  is  the 
percentage  of  times  that  one  measure  performs  better  than  another,  using  any  kind  of  the 
mentioned  accuracy  measures.  Hyndman  and  Koehler  (2006)  mention  two  disadvantages 
of  this  metric.  First,  it  is  not  sensible  to  the  size  of  errors  and  second,  it  does  not  provide  a 
clear  idea  of  how  much  improvement  is  possible. 


e.  Scaled  Error  Metric 

Hyndman  and  Koehler  (2006)  developed  a  new  metric  based  on  the  principles  of 
Relative  Error  Metrics  and  Relative  Metrics.  The  rationale  is  to  solve  existing  problems 
in  the  mentioned  metrics  by  dealing  with  scaled  errors  (qt).  The  scaling  factor, 
denominator  of  the  scaled  error,  is  the  MAE  of  in-sample  values  of  a  benchmark  forecast 
method. 


The  scaled  error  is  defined  by  the  following  equation: 


where. 


(2.21) 


j  =  sample  time  index 

k  =  time  index  of  the  last  in-sample  observation 

Hence,  the  error  measured  in  a  given  time  (et)  is  divided  by  the  MAE  of  a 
benchmark  forecast  method,  only  considering  the  in-sample  time  period. 

Hyndman  and  Koehler  (2006)  propose  a  particular  type  of  scaled  error,  in  which 
the  benchmark  is  the  naive  method.  Because  of  that,  the  identity  fj  -  a-_^  can  be  applied 
to  adjust  the  equation.  Moreover,  they  assume  that  the  in-sample  data  comprehends 
periods  from  1  to  k.  That  makes  the  difference  —  f-  applicable  from  period  2  to  k,  as 

the  first  fj  value  possible  uses  value.  As  result  of  that,  there  are  k-1  observations  to  be 
considered  in  the  denominator  of  . 


Applying  the  mentioned  adjustments,  the  following  equation  results: 
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q,  = 


1 


(2.22) 


k  - 


,ZI 

^  i=2 


\a.-a.A 


After  that,  the  Mean  of  Absolute  Scaled  Error  is  just  given  by: 

MASE  =  mean  \q^  |  (2.23) 

The  interpretation  of  results  has  to  follow  the  same  instructions  as  exposed  for 
relative  measures.  The  only  case  that  scaled  error  equations  do  not  work,  is  when  all  in- 
sample  errors  equal  zero.  We  were  also  not  able  to  find  any  negative  critiques  of  this 
metric  in  literature,  so  because  of  these  factors  we  choose  MASE  to  be  our  metric  of 
choice  to  compare  against  the  accuracy  metric  proposed  in  the  CIMIP.  Additionally,  in 
Chapter  III,  we  present  a  further  discussion  on  the  importance  of  using  benchmarks  when 
measuring  forecast  accuracy. 


3,  Forecast  Accuracy  Metrics  Currently  Used  in  the  Defense 
Environment 

As  part  of  CIMIP  implementation.  Office  of  the  Secretary  of  Defense  (OSD) 
established  two  metrics  to  measure  forecast  accuracy  and  forecast  bias,  while  components 
already  had  their  own  ways  to  keep  track  of  the  goodness  of  their  forecasts.  This  section 
aims  to  introduce  the  equations  used  by  DOD  and  Navy,  presenting  brief  comments  about 
their  main  features. 


a.  DOD ’s  Forecast  Accuracy  Metrics 

The  challenge  with  a  common  metric  that  is  self-reported  is  to  ensure  that  each 
group  is  calculating  the  metric  correctly.  To  address  this  issue,  the  DOD  published 
internal  business  rules  to  standardize  the  reporting  effort  among  the  components  (DOD, 
2013).  As  mentioned  in  Chapter  I  of  our  research,  the  CIMIP  metric  required  specific 
data  elements  of  the  forecast  and  demand  history,  yet  these  business  rules  also  detail  what 
data  should  not  be  included.  As  stated  in  the  introduction  to  the  business  rules  document, 
the  CIMIP  “forecasting  metrics  are  not  the  mechanism  to  reduce  error;  however  the 
metrics  will  create  a  common  baseline  from  which  to  measure  the  impact  of  other 
initiatives”  (DOD,  2013,  p.  2).  The  results  of  these  forecast  accuracy  and  bias  metrics  are 
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to  be  reported  semi-annually  at  the  DOD’s  inventory  management  reviews,  as  well  as 
monitored  by  the  CIMIP  foreeasting,  total  asset  visibility,  multi-eehelon  modeling 
working  group  and  the  supply  ehain  metrics  group  (DOD,  2013). 

The  components  are  responsible  for  collecting  all  of  the  data  necessary  to 
compute  the  metrics,  which  should  include  all  items  for  which  the  components  use  some 
type  of  forecast  algorithm.  This  excludes  items  whose  requirements  determination  is 
impacted  by  component  business  rules,  performance-based  contracts  and  foreign  military 
sales.  The  metric  also  excludes  unforecastable  items,  which  either  do  not  have  a  demand 
forecast  rate,  or  whose  forecast  and  actual  demand  during  the  reporting  period  is  equal  to 
zero.  Although  the  components  are  free  to  generate  forecasts  with  the  method  and  time 
horizon  of  their  choosing,  they  are  required  to  insert  12-month  forecasts  and  actual 
demands  in  the  calculations. 

The  implementation  of  standard  metrics  to  assess  forecast  accuracy  and  bias  is 
one  of  the  required  actions,  contained  in  CIMIP,  to  address  the  DOD  need  for  better 
forecasts.  From  this  point  on,  we  are  going  to  refer  to  those  metrics  as  being  CIMIPf, 
aggregated  forecast  accuracy  obtained  at  a  given  period  of  time,  and  CIMIPb,  forecast 
bias,  as  follows: 

CIMIPj  =  -  *100%  (2.24) 

(=1 

CIMIPf  =  - *  1 00%  (2.25) 

i=l 

where, 

n  =  number  of  items  in  the  forecast  dataset 

Ci  =  unit  cost  for  item  i 

fi  =  demand  forecast  for  item  i 
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a,  =  actual  demand  for  item  i 


A  elose  look  at  CIMIP^  metrie  reveals  a  eertain  similarity  to  MAPE,  Equation 

(2.7).  The  first  notable  differenee  is  the  one  minus  before  the  fraetion.  It  implies  the 
rationale  that  accuracy  is  better  when  error  is  small  and  does  not  represent  any  harm  to 
the  interpretation  of  results.  Another  important  differenee  is  that  CIMIP^  is  a  division  of 

summations,  instead  of  a  summation  of  divisions.  Additionally,  we  assume  that  CIMIPj^ , 

as  an  inventory  foreeast  aeeuraey  metrie,  uses  unit  eosts  to  weight  the  importanee  of 
expensive  items  within  the  dataset  and  not  as  an  evaluation  of  budget  impaets. 

As  mentioned  in  the  introduetion,  the  aeeuraey  metrics  contained  in  CIMIP  are 
the  eentral  issue  of  this  researeh.  Therefore,  eareful  diseussion  and  evaluation  about  those 
eharacteristics  are  presented  in  Chapter  III. 

b.  Navy ’s  Forecast  Accuracy  Metric 

GAO  eritieized  NAVSUP’s  seeondary  inventory  management  and  reeommended 
that  it  “evaluate  and  improve  demand  foreeasting  proeedures,”  (GAO,  2008,  p.  5).  Then, 
a  NAVSUP  team  developed  the  Lead-time  Adjusted  Symmetrie  Error  (EASE),  as  their 
demand  foreeast  aeeuraey  metrie,  more  than  a  year  prior  to  the  release  of  the  CIMIP 
foreeast  aeeuraey  metrics  (Bencomo,  2010). 

After  determining  that  traditional  aeeuraey  measurements,  sueh  as  MSE  and 
MAPE,  were  insuffieient,  they  eombined  two  proposed  solutions  for  ealeulating 
pereentage-error  for  intermittent  demand:  sMAPE  and  Denominator-Adjusted  MAPE 
{DAM)  (Hoover,  2006). 

The  advertised  benefits  of  the  EASE  metrie  were  that  it  is  eapable  to  provide 
results  with  demand  data  that  is  highly  intermittent,  it  does  not  generate  a  division-by¬ 
zero  error,  and  it  returns  a  symmetrieal  assessment  of  over  and  under  foreeasting.  The 
equation  follows: 
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LASE 


[(/,+«',)/ 

Actually,  LASE  equation  is  a  combination  of  two  aspects  present  in  Hoover 
(2006).  The  first  is  that  sMAPE,  Equation  (2.11),  is  a  good  way  to  measure  forecast 
accuracy,  when  forecast  or  actual  demand  is  different  from  zero.  The  second  is  that  when 
forecast  and  demand  are  zero,  managers  should  adjust  the  denominator  by  applying  the 
addition  of  one.  However,  instead  of  applying  the  denominator  adjustment  only  in  cases 
that  forecast  and  actual  demand  are  both  zero,  the  LASE  metrie  applies  the  adjustment  as 
a  general  rule.  This  characteristic  aims  to  ensure  consistency,  as  opposite  to  the  use  of 
different  criteria  for  different  items. 


2]  +  l 


The  following  equation  is  a  more  consistent  version  of  the  LASE  equation  to  the 
one  proposed  in  Hoover  (2006): 


LASE 


[(/^  +  aJ/2]  +  l 


(2.27) 


being. 


if  f  +(2  —  0,  then  /  =  0  and  J  =1', 


if  f  +(2 ^0,  then  7  =  1  and  J  =  0 . 

However,  we  consider  the  complexity  of  EASE’  as  a  drawback,  as  well  as  its  lack 
of  criteria  consistency,  as  different  items  are  subjected  to  different  rules. 

One  year  after  the  metric  was  released,  Jackson  (2011)  demonstrated  that  the 
LASE  metric  had  an  inherent  smoothing  effect  that  hampers  the  identification  of  large 
divergences  between  the  forecast  methods.  By  the  end  of  the  study,  he  concluded  against 
of  its  use.  Despite  that,  NAVSUP  continues  to  utilize  the  LASE  metric  as  an  internal 
managerial  tool  to  measure  forecast  accuracy. 


C.  CHAPTER  SUMMARY 

The  forecast  accuracy  field  of  research  has  significantly  evolved  during  the  last 
sixty  years,  following  the  evolution  of  eomputing  capabilities.  Massive  analyses  and  deep 
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considerations,  in  literature,  provide  relevant  findings.  From  those,  we  highlight  the 
following  as  the  key  learning  points  of  this  Chapter: 

•  There  is  no  absolute  best  forecast  method. 

•  More  complex  forecast  methods  do  not  always  improve  accuracy. 

•  Combining  forecast  methods  will  likely  result  in  more  accurate  foreeasts. 

•  Forecast  accuracy  can  be  measured  aeross  two  dimensions:  the  first  is  time 
and  the  seeond  is  line  items. 

•  Scale  dependent  metrics  do  not  allow  aggregation  of  results. 

•  Pereentage  error  metrics  are  vulnerable  to  zero  aetual  demand. 

•  Relative  error  metries  and  relative  metries  are  vulnerable  to  the  oceurrence 
of  any  zero  error. 

•  MASE  avoids  the  flaws  of  many  traditional  metrics  and  remains  in  good 
standing  among  academie  literature  reviews. 

Separate  from  the  evolutionary  proeess  of  academic  literature  on  forecast 
accuracy,  the  DOD  and  Navy  developed  their  own  forecast  accuracy  metrics,  respectively 
C/M/F/ and  LASE,  in  an  attempt  to  quantify  and  improve  their  foreeasting  efforts. 
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III.  ANALYSES  ON  CIMIP  FORECAST  ACCURACY  METRIC 


A,  INTRODUCTION 

This  chapter  will  examine  whether  the  eurrent  DOD  foreeast  aeeuraey  metrie  has 
any  inherent  flaws  and  if  there  are  any  alternative  foreeast  aeeuraey  metries  that  avoid 
these  flaws  and  produce  higher  quality,  more  relevant  results. 

B,  EVALUATION  OF  CURRENT  METRIC 

At  first  glanee,  the  CIMIPf  metrie,  Equation  (2.24),  appears  to  be  similar  to 
MAPE,  Equation  (2.7),  whieh  as  we  discussed  in  Chapter  II  is  a  traditional  foreeast 
aeeuraey  metrie.  The  main  difference  between  the  two  metries  is  that  MAPE  is  a 
summation  of  divisions,  while  CIMIPf  is  a  division  of  summations  that  ineludes  unit  eosts 
as  a  way  to  eonvert  values  to  a  eommon  unit  of  measurement  and  prioritize  the  foreeast 
performanee  of  expensive  items. 

While  MAPE  is  a  broadly  studied,  traditional  metric,  it  contains  specific  flaws  that 
limit  the  scope  of  it  applicability.  In  this  section,  we  will  investigate  whether  those 
differenees,  along  with  other  specifie  characteristies,  make  CIMIPf  a  reeommendable 
managerial  tool  to  assess  foreeast  aeeuraey. 

1.  Division  of  Summations 

One  of  the  main  objectives  of  any  foreeast  aeeuraey  metric  that  utilizes  division 
of  a  numerator  by  a  denominator  is  to  avoid  unit-of-measure  dependenee  in  order  to 
enable  aggregation  of  results  aeross  a  range  of  produets.  CIMIPf ,  on  the  other  hand, 

aggregates  the  results  into  dollars,  by  ineluding  unit  eosts,  in  both  the  numerator  and 
denominator  before  the  division  oeeurs.  This  division  of  the  total  foreeast  error  in  dollars 
by  the  total  aetual  demand  in  dollars  produees  a  seale-free,  dollar-weighted  result. 

To  illustrate  the  methodologie  difference,  we  eompare  CIMIPf  metric  to  a  cross- 
sectional  extension  of  MAPE,  in  the  manner  that  they  determine  their  results.  The 
equation  for  that  variation  of  MAPE  is: 
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MAPEj  =  mean  p. 


(3.1) 


where; 

€ 

Pi  =—  ;  and 
a. 

e^^fi-Ui 

MAPEf  calculation  first  obtains  the  absolute  percentage  errors  \pt\  at  the  item 
level,  then  the  individual  results  are  averaged.  CIMIPf  first  converts  the  numerator  and 
denominator  for  each  item  into  dollars,  proceeds  the  summations  the  numerators  and 
denominators  separately,  and  then  divides  one  by  the  other  to  generate  a  forecast 
accuracy  result  that  represents  the  entire  population.  In  this  example  we  have  adjusted 
MAPE  to  the  aggregated  level  to  enable  comparison,  yet  we  could  have  adjusted  CIMIPf 
to  the  individual  level  to  accomplish  the  same.  Later,  Equation  (3.2)  will  present  this 
extension  of  CIMIPf.  Table  2.  and  Table  3.  provide  an  example  of  the  methodologic 
distinction. 


Table  2.  MAPE  Calculation 


Items 

fi 

Ci 

Pi 

1 

23.84 

32 

-8.16 

25.5% 

2 

21.26 

20 

1.26 

6.3% 

3 

0 

2 

-2 

100% 

4 

235.42 

151 

84.42 

55.9% 

MAPEf 

46.93% 

The  far  right  column  shows  how  MAPE  first  calculates  individual  absolute  percentage 
errors  and  then  averages  them  to  get  the  final  value. 


Recalling  C/M/P/metric: 


Equation  (2.24);  CIMIPf 


/=! _ 

n 

(=1 


*100% 
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Table  3 .  ClMlPf  Calculation 


Items 

fi 

ai 

Ci 

1  fi-ai  1 

Ci*ai 

Ci*  1  fi-a,  1 

1 

23.84 

32 

$  1,354,173.00 

8.16 

$  43,333,536.00 

$  11,050,051.68 

2 

21.26 

20 

$ 

43,125.00 

1.26 

$  862,500.00 

$  54,337.50 

3 

0 

2 

$ 

32,815.00 

2 

$  65,630.00 

$  65,630.00 

4 

235.42 

151 

$ 

260,000.00 

84.42 

$  39,260,000.00 

$21,949,200.00 

Sum  $  83,521,666.00 

$  33,119,219.18 

CIMIPf  60% 

The  two  far  right  columns  of  Table  2  demonstrate  how  CIMIPf  sums  the  numerator  (total 
dollar  error)  and  denominator  (total  dollar  demand)  separately  before  dividing  them, 
subtracting  from  one  and  then  multiplying  by  100  to  generate  the  final  C/M/P/  value. 


Moreover,  as  mentioned  in  Chapter  II,  MAPE's  results  at  the  item  level  do  not 
generate  a  solution  when  actual  demand  is  zero.  This  division  by  zero  error  negates  the 
ability  to  generate  an  average  result,  unless  those  non-solutions  are  ignored,  which  then 
degrades  the  entire  accuracy  measurement. 

Meanwhile,  CIMIPf  metric  avoids  that  effect  by  applying  a  summation  in  the 
denominator  to  account  for  the  fact  that  the  data  can  include  items  with  zero  demand. 
Thus,  CIMIPf  metric  is  able  to  produce  valid  results  even  when  the  data  set  contains 
values  of  zero  for  either  the  actual  demand  or  forecast  of  individual  line  items. 

Therefore,  we  claim  that  CIMIPf  metric  is  more  robust  than  MAPE.  The  only  case 
which  CIMIPf  equation  does  not  produce  a  valid  result  is  when  actual  demands  of  all 
items  considered  are  zero.  Table  4.  aims  to  provide  evidence  of  the  superiority  of 
CIMIPf,  in  terms  of  robustness,  when  compared  to  MAPEf. 
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Table  4.  Test  of  Relative  Robustness  of  CIMIPf  Compared  to  MAPEf 


Items 

fi 

ai 

Pi 

1  fi-ai  1 

Ci*ai 

Ci*  1  fi-ai  1 

1  fi-ai  1  /ai 

1 

23.84 

32 

$  1,354,173.00 

8.16 

$43,333,536.00 

$  11,050,051.68 

25.5% 

2 

21.26 

0 

$  43,125.00 

21.26 

$ 

$  916,837.50 

CO 

3 

0 

2 

$  32,815.00 

2 

$  65,630.00 

$  65,630.00 

100% 

4 

235.42 

151 

$  260,000.00 

84.42 

$  39,260,000.00 

$21,949,200.00 

55.9% 

Sum 

$  82,659,166.00 

$33,981,719.18 

CIMIPf 

59% 

MAPEf  oo 

In  this  case,  the  actual  demand  of  item  2  is  zero,  what  harms  the  entire  calculation  of 
MAPE,  while  CIMIPf  still  produces  a  valid  result.  This  supports  the  Hyndman  &  Koheler 
(2006)  recommendation  that  MAPE  should  not  be  used  in  data  sets  that  contain  actual 
demands  of  zero. 


2.  The  Role  of  Unit  Costs 

As  mentioned,  CIMIPf  is  caleulated  differently  than  the  most  traditional  foreeast 
aeeuraey  metries,  as  it  implies  that  summations  of  foreeast  errors  and  aetual  demand 
values  have  to  be  made  before  the  division,  thus  requiring  the  input  data  to  be  in  the  same 
unit-of-measure.  In  that  eontext,  unit  eosts  are  used  as  a  means  to  standardize  the  units- 
of-measure  of  an  items’  demand,  allowing  the  summations  to  oeeur  in  both  the  numerator 
and  denominator. 

In  addition,  the  inelusion  of  unit  eost  also  provides  a  weighting  meehanism  that 
prioritizes  the  aeeuraey  of  more  expensive  items  over  less  expensive  items.  In  the 
literature  we  reviewed,  there  is  no  mention  of  the  use  of  weightings  by  the  foreeast 
aeeuraey  metries.  All  traditional  equations  are  ealeulated  around  the  foreeast  error. 
Equation  (2.1),  eonsidering  just  two  independent  variables,  foreeast  values  and  aetual 
demands.  The  introduction  of  another  independent  variable  such  as  unit  cost,  in  the  case 
of  CIMIPf  ,  may  affect  the  results.  While  measuring  forecast  demand  error  in  dollars  is  a 

workable  metric,  the  stated  goal  of  CIMIPf  is  to  produce  a  percentage  measure  of  forecast 
accuracy. 

Another  point  against  the  use  of  unit  costs  is  that  a  secondary  objective  of  CIMIPf 

metric  is  to  avoid  excess  inventory  and  the  related  costs.  One  can  think  that  organizations 

must  avoid  excess  inventory  of  high  unit  cost  items  to  reduce  unwanted  financial  impacts. 

However,  total  inventory  cost  is  composed  of  holding,  transportation,  handling, 
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acquisition  and  shortage  costs.  Of  these  five  costs,  only  holding  cost  is  directly  affected 
by  unit  eosts,  and  although  positive  correlations  between  unit  costs  and  transportation, 
handling  and  shortage  costs  are  possible,  they  are  not  eertain.  While  cost  is  important  to 
prioritize  forecasting  efforts,  other  factors  such  as  criticality  and  interchangeability  could 
also  be  considered.  Acknowledging  that  unit  cost  is  not  the  main  driver  for  the  total 
inventory  cost  or  prioritization,  we  infer  that  forecast  accuracy  should  be  measured  as  a 
funetion  of  forecast  and  actual  demand  values. 

To  determine  the  positive  and  negative  of  using  unit  cost  in  the  equation,  we  need 
to  test  to  what  extent  it  can  significantly  affect  the  interpretation  of  foreeast  aeeuraey.  To 
do  this,  we  built  a  test  composed  of  four  data  sets.  Table  5.  through  Table  8.  ,  that  keep 
foreeast  and  demand  values  constant,  while  allowing  the  unit  costs  to  vary: 


Table  5. 

Test  of  Cost  Impact  on  CIMIPj  - 

Data  Set  I 

Items 

fi 

Ci 

1  fi-ai  1 

Ci*ai 

Ci*  1  fi-ai  1 

1 

90 

100 

$1,000.00 

10 

$100,000.00 

$10,000.00 

2 

30 

100 

$50.00 

70 

$5,000.00 

$3,500.00 

3 

50 

100 

$20.00 

50 

$2,000.00 

$1,000.00 

4 

80 

100 

$250.00 

20 

$25,000.00 

$5,000.00 

Sum 

$132,000.00 

$19,500.00 

CIMIPf 

85% 

Table  6. 

Test  of  Cost  Impact  on  CIMIPf- 

Data  Set  2 

Items 

fi 

Ci 

1  fi-ai  1 

Ci*ai 

Ci*  1  fi-ai  1 

1 

90 

100 

$50.00 

10 

$5,000.00 

$500.00 

2 

30 

100 

$1,000.00 

70 

$100,000.00 

$70,000.00 

3 

50 

100 

$20.00 

50 

$2,000.00 

$1,000.00 

4 

80 

100 

$250.00 

20 

$25,000.00 

$5,000.00 

Sum 

$132,000.00 

$76,500.00 

CIMIPf 

42% 
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Table  7.  Test  of  Cost  Impaet  on  CIMIPf-  Data  Set  3 


Items 

fi 

ai 

Ci 

1  fr^i  1 

Ci*ai 

Ci*  1  fi-ai  1 

1 

90 

100 

$20.00 

10 

$2,000.00 

$200.00 

2 

30 

100 

$50.00 

70 

$5,000.00 

$3,500.00 

3 

50 

100 

$1,000.00 

50 

$100,000.00 

$50,000.00 

4 

80 

100 

$250.00 

20 

$25,000.00 

$5,000.00 

Sum 

$132,000.00 

$58,700.00 

CIMIPf 

56% 

Table  8. 

Test  of  Cost  Impact  on  CIMIPf- 

Data  Set  4 

Items 

fi 

^i 

Ci 

1  fr^i  1 

Ci*ai 

Ci*  1  fi-ai  1 

1 

90 

100 

$250.00 

10 

$25,000.00 

$2,500.00 

2 

30 

100 

$50.00 

70 

$5,000.00 

$3,500.00 

3 

50 

100 

$20.00 

50 

$2,000.00 

$1,000.00 

4 

80 

100 

$1,000.00 

20 

$100,000.00 

$20,000.00 

Sum 

$132,000.00 

$27,000.00 

CIMIPf  80% 


CIMIPf  results  ranged  from  42%  to  85%,  what  may  lead  to  diverse  interpretations  of 
foreeast  aeeuraey. 


The  results  of  this  test  demonstrate  that  the  presence  of  unit  cost  in  CIMIPf  metric 
harms  the  quality  of  the  item  demand  forecast  accuracy  measurement. 

3,  Production  of  Intuitive  Results 

C/M/P/ uses  two  features  commonly  found  in  percentage  equations.  It  first  applies 
the  complementary  concept  of  “one  minus  the  fraction”,  then  it  multiplies  that  fractional 
value  by  100  to  produce  a  percentage  result. 

However,  percentage  equations  are  expected  to  produce  values  between  zero  and 
one,  which  does  not  occur  in  CIMIPf.  The  summation  of  errors,  CIMIP/s  numerator,  can 
be  higher  than  summation  of  actual  demands,  CIMIP/s  denominator.  That  condition 
causes  the  fraction  to  be  bigger  than  one  and  the  final  number  to  be  negative  and 
unbounded,  which  we  consider  counter-intuitive. 

To  demonstrate  that,  we  built  a  test  comprised  of  two  hypothetical  items,  as 
follows: 

36 


Table  9.  Generation  of  Counter-Intuitive  Results  -  Initial  Data  Set 


fj _ a, _ Pi _ I  fj-aj  I _ Ci*ai _ Cj*  |  fj-aj  | 

Test  item  1110  1  0 

Fixed  item _ 1 _ 1 _ 1 _ 0 _ 1 _ 0 _ 

Sum _ 2 _ 0 _ 

CIMIPf  100% 


By  allowing  the  forecast  value  of  the  test  item  to  vary  from  one  to  10,  we 
obtained; 


Figure  6.  Generation  of  Counter-Intuitive  Results  by  CIMIPf 


Counter-intuitive,  negative  results  are  generated  by  CIMIPf  in  cases  where  the 
summation  of  errors  is  larger  than  the  summation  of  actual  demands.  Considering  the 
results  at  the  item  level,  we  infer  that  products  with  errors  larger  than  actual  demand  may 
exert  significant  negative  pressure  on  the  aggregated  C/M/P/^  result. 

Furthermore,  under-estimations  are  bounded  by  zero  and  all  cases  of  forecast 
errors  larger  than  actual  demand  only  occur  with  over-estimations.  That  inherent 
characteristic  of  forecast  errors  helps  all  accuracy  metrics  to  penalize  the  occurrence  of 
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extremely  large  over-estimations,  which  are  closely  related  to  the  formation  of  excess 
inventory. 

4,  Composition  of  Data  Matters 

If  the  probability  of  occurrence  of  errors,  that  are  bigger  than  actual  demand,  is 
assumed  to  change  along  with  demand  size,  then  the  composition  of  data  may  affect 
CIMIPf  results.  One  can  intuitively  assume  that  low-demand  items  are  more  likely  to 
have  errors  bigger  than  their  actual  demands.  Considering  that,  if  a  data  set  is  primarily 
comprised  of  low-demand  items,  a  poor,  or  even  negative,  CIMIPf  result  is  to  be 
expected. 

To  validate  the  rationale  that  composition  of  data  matters,  first,  we  need  to  test  the 
assumption  that  errors  bigger  than  demand  are  more  frequent  in  low-demand  items. 
According  to  FY15  data,  among  44,675  NIINs,  24,309  (54.41%)  had  errors  bigger  than 
demand  and  they  were  distributed  according  to  the  following  histogram: 

Figure  7.  Histogram  of  Items  with  Errors  Bigger  than  Demand  in  FY 1 5 


FY15  Demands 


Vertical  axle  in  exponential  scale  helps  to  picture  the  extreme  skewness  of  the  data. 

Second,  we  divided  the  data  into  low-demand  and  high-demand  items,  according 
to  a  quantile  approach,  to  compare  C/M/P/  results. 
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Table  10.  C/M/P/ Results  on  Low-Demand  Versus  High-Demand  Items 

(FY15) 

Dmd  size  Dollar  error  Dollar  dmd  CIMIPf 

_ items _ _ 

Low-demand  0-1  28,235  $555,585,938.00  $125,993,049.00  -341% 

High-demand  2-inf  15,690  $2,022,307,097.00  $4,863,140,807.00  58% 

Aggregate  0-inf  43,925  $2,577,893,035.00  $4,989,133,856.00  48% 

There  is  clear  evidence  that  low-demand  items  can  exert  a  negative  pressure  on  the  overall 
result. 

Additionally,  Table  11.  shows  that  C/M/P/ results  tend  to  be  better  as  we  only 
eonsider  items  with  higher  demand.  The  aggregate  CIMIPf,  48%,  disguises  the  faet  that 
for  high-demand  items  the  dollar-error  is  relatively  small,  while  for  low-intermittent 
demand  items,  the  dollar-error  relative  to  the  aetual  dollar-demand  is  very  large. 


Table  1 1 .  Data  Composition  and  C/M/P/ Variation  (FY15) 


Dmd  size 

Qty  of  items 

Dollar  error 

Dollar  dmd 

CIMIPf 

0-inf 

43925 

$  2,577,893,034.49 

$4,989,133,856.14 

48.33% 

100-inf 

546 

$  294,122,953.00 

$  941,957,920.00 

68.78% 

500-inf 

90 

$  22,186,297.00 

$  79,316,220.00 

72.03% 

1000-inf 

49 

$  13,291,976.00 

$  48,555,715.00 

72.63% 

We  partially  attribute  those  inereasing  CIMIPf  results  to  the  faet  that  the  errors 
bigger  than  demand  are  more  unlikely  as  demand  inereases.  But,  on  top  of  that,  there  is 
the  faet  that  items  with  higher  demand  usually  display  a  pattern  that  faeilitates  the 
generation  of  aeeurate  foreeasts. 

Therefore,  eombining  results  of  the  three  tests  eondueted  in  this  seetion,  we  infer 
that  the  eomposition  of  the  data  set,  expressed  as  a  ratio  of  high  and  low-demand  items, 
ean  signiHeantly  affeet  CIMIPf  results.  The  higher  the  ratio  of  low  to  high-demand  items, 
the  more  likely  the  result  will  be  a  lower  foreeast  aeeuraey  measurement. 

C.  COMPARATIVE  ANALYSIS 

Considering  the  potential  flaws  of  CIMIPf,  mentioned  above,  a  eomparative 

analysis  is  neeessary  to  allow  a  judgment  about  the  existenee  of  a  better  metrie.  After 
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reviewing  the  existing  literature,  we  seleeted  an  alternative  metrie  and  developed  a 
framework  to  allow  a  fair  eomparison  between  the  two  metries. 


1.  Alternative  Metric  Selection 

As  discussed  in  the  literature  review,  MASE  is  intuitively  expected  to  gather  most 
of  the  desirable  characteristics  of  a  forecast  accuracy  measure,  thus  justifying  its  use  as 
an  alternate  metric  for  comparison.  Specifically,  one  of  the  main  characteristics  of  MASE 
is  the  capacity  to  produce  accuracy  results  at  the  item  level,  even  when  actual  demand  is 
zero,  as  well  as  at  the  aggregate  level.  Another  important  characteristic  is  that  it  enables  a 
fair  comparison  among  the  services  and  DLA  through  its  use  of  a  benchmark  method 
instead  of  generating  absolute  values. 

a.  Further  Discussion  on  Performance  Benchmarking 

According  to  Dictionary.com,  the  word  benchmark  is  “any  standard  or  reference 
by  which  others  can  be  judged”  and  the  practice  of  using  a  benchmark  to  measure 
performance  is  widely  practiced.  An  additional  definition  of  the  word  is  “a  standard  of 
excellence,  achievement,  etc.,  against  which  similar  things  must  be  measured  or  judged” 
(Ditcionary.com)  and  this  idea  of  comparing  similar  things  is  key.  Most  people  have 
heard  a  version  of  the  phrase  comparing  apples  and  oranges  and  it  applies  to  many  areas 
where  comparisons  are  made  between  two  or  more  things.  In  our  research  we  have 
discussed  how  DOD  intends  to  measure  the  forecasting  performance  of  the  military 
services  and  DLA  by  calculating  how  well  each  of  them  generated  forecasts  for  the 
material  that  they  manage.  While  this  exercise  in  measurement  and  comparison  is 
intended  to  complement  the  goals  of  the  overall  CIMIP,  it  does  not  mean  that  we  are 
making  a  true  “apples  to  apples”  comparison. 

CIMIPf  is  simply  computed  by  inserting  forecasted  demand,  actual  demand  and 

unit  cost  for  each  item  into  the  equation,  which  then  produces  one  number.  Although 

each  service  and  DLA  is  engaged  in  managing  secondary  inventory,  the  material, 

quantity  and  demand  patterns  of  this  inventory  are  not  the  same.  While  they  may  appear 

similar  and  in  some  ways  are,  the  fact  is  that  they  each  face  unique  challenges  in 

forecasting  their  demand  and  it  is  potentially  misleading  to  directly  compare  their 
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performance.  To  illustrate  this  point  with  something  that  all  federal  employees  are 
familiar  with,  we  will  examine  the  use  of  benchmarks  by  the  Thrift  Savings  Plan  (TSP). 

On  April  1,  1987,  the  TSP  began  operations  with  a  single  fund,  known  as  the  G 
Fund,  which  invested  solely  in  government  securities  that  were  not  available  to  the 
public.  By  2001,  the  number  of  investment  funds  available  in  the  TSP  had  grown  to  five 
with  the  inclusion  of  the  fixed  income  F  Fund,  the  common  stock  C  Fund,  the  small 
capitalization  stock  S  Fund  and  the  international  stock  I  Fund.  Following  common 
industry  practice,  since  each  of  these  four  new  funds  were  invested  in  securities  available 
to  the  public,  each  funds’  performance  is  compared  against  a  commercial  index  made  up 
of  similar  assets.  These  commercial  indexes  act  as  performance  benchmarks  for  the 
funds.  Since  the  TSP  funds  are  modeled  after  these  commercial  indexes,  a  strategy 
known  as  passive-management,  their  performance  does  not  vary  much  from  the  index. 
This  common  industry  practice  becomes  more  important  with  actively  managed  funds, 
where  managers  are  attempting  to  outperform  these  commercial  indexes.  Table  12. 
shows  the  TSP  fund  with  its  respective  index  or  benchmark  and  Table  13.  compares  the 
performance  of  the  TSP  funds  against  their  benchmark  index. 


Table  12.  TSP  Fund  and  Benchmark  Index.  Adapted  from  Thrift  Savings 

Plan  (n.d.b). 


TSP  Fund 

Commercial  Benchmark 

G  Fund 

N/A 

F  Fund 

Barclays  Capital  U.S.  Aggregate  Bond  Index 

C  Fund 

Standard  &  Poor's  500  Stock  Index 

S  Fund 

Dow  Jones  U.S.  Completion  TSM  Index 

1  Fund 

MSCI  EAFE  Stock  index 
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Table  13.  TSP  and  Index  Annual  Returns  2011-2015.  Souree;  Thrift  Savings 

Plan  (n.d.a). 


Year 

G  Fund 

FFund 

U.S. 

Agg.  Bond 
Index 

C  Fund 

S&P  500 

Index 

S  Fund 

DJ  U.S. 
Completion 
TSM  Index 

1  Fund 

EAFE 

Index 

2011 

2.45% 

7.89% 

7.84% 

2.11% 

2.11% 

-3.38% 

-3.76% 

-11.81% 

-12.14% 

2012 

1.47% 

4.29% 

4.22% 

16.07% 

16.00% 

18.57% 

17.89% 

18.62% 

17.32% 

2013 

1.89% 

-1.68% 

-2.03% 

32.45% 

32.39% 

38.35% 

38.05% 

22.13% 

22.78% 

2014 

2.31% 

6.73% 

5.97% 

13.78% 

13.69% 

7.80% 

7.63% 

-5.27% 

-4.90% 

2015 

2.04% 

0.91% 

0.55% 

1.46% 

1 .38% 

-2.92% 

-3.42% 

-0.51% 

-0.81% 

This  table  demonstrates  how  an  individual  TSP  funds’  performance  compares  to  a 
benchmark  index,  rather  than  a  simple  comparison  to  the  other  TSP  funds. 


The  eomparison  to  these  benehmark  index  funds  enables  managers  and  potential 
investors  to  better  judge  the  effeetiveness  of  the  TSP  fund  managers  to  meet  their 
intended  objeetive.  For  example,  an  S  Fund  investor  should  be  satisfied  with  the 
management  of  his  fund  for  all  five  years  even  though  the  C  Fund  had  better  returns  in 
three  of  the  five  years.  An  apples-to-oranges  eomparison  of  the  S  and  C  Funds  over  these 
five  years  would  conclude  that  the  S  Fund  manager  performed  better  in  only  two  of  the 
five  years,  while  the  C  Fund  manager  performed  better  in  three  of  the  five  years.  An 
apples-to-apples  comparison  of  these  two  fund  managers  would  conclude  that  both  of 
them  matched  or  exceeded  the  performance  of  their  benchmark  index  in  all  five  years. 

b.  DOD  Forecasting  Benchmarks 

The  same  principle  of  comparing  investment  fund  performance  to  a  relevant 
benchmark  applies  to  the  comparison  of  the  services  and  DLA  in  their  year-to-year 
forecasting  performance.  Concluding  that  one  service  forecasted  better  than  another, 
based  on  a  single  CIMIPf  metric  result,  ignores  the  fact  that  the  lower-performing  service 
may  be  managing  material  that  is  much  more  challenging  to  forecast  than  the  higher¬ 
performing  service.  To  date  the  DOD  has  resisted  GAO  recommendations  to  set  standard 
forecasting  performance  goals,  which  could  potentially  result  in  apples-to-oranges 
comparisons.  The  DOD  has  stated  that  it  wanted  “to  establish  a  baseline  of  performance 
on  the  metrics  prior  to  setting  any  department- wide  goals”  (GAO,  2015b,  p.  43),  yet  a 
department-wide  goal,  while  simple,  may  not  be  as  effective  in  measuring  true  forecast 
performance.  An  alternate  method  would  be  for  each  service  to  generate  forecast 
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accuracy  metrics  for  a  naive  method  foreeast  of  their  material  and  compare  that  with  their 
actual  performance.  In  keeping  with  our  investment  fund  analogies,  this  method  of 
evaluation  is  similar  to  how  aetively  managed  investment  portfolios  are  eompared  against 
an  index  of  similar  assets. 

The  calculation  of  a  naive  method  simply  requires  the  user  to  determine  the  level 
of  demand  for  the  preceding  period  and  then  assume  that  the  demand  will  remain  the 
same  in  the  future  period. 

In  order  to  exemplify  the  function  of  naive  method  as  a  benehmark,  Table  14. 
presents  a  set  of  three  hypothetic  items  with  different  levels  of  demand  variability,  what 
is  visualized  in  Figure  8.  ,  along  with  their  aecuracy  results,  measured  by  four  different 
metries. 
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Table  14.  Naive  Method  as  a  Benchmark 


Item  1 


Time 

Demand 

naive 

Err 

Abs  err 

Sq  err 

APE 

1 

10 

0% 

2 

9 

10 

-1 

1 

1 

11% 

3 

11 

9 

2 

2 

4 

18% 

4 

10 

11 

-1 

1 

1 

10% 

5 

9 

10 

-1 

1 

1 

11% 

6 

11 

9 

2 

2 

4 

18% 

7 

11 

11 

0 

0 

0 

0% 

8 

10 

11 

-1 

1 

1 

10% 

9 

9 

10 

-1 

1 

1 

11% 

10 

10 

9 

1 

1 

1 

10% 

Stdev 

0.8164966 

MAE 

1.11 

Avg 

10 

MSE 

1.56 

cv 

0.0816497 

CIMIP 

90% 

MAPE 

10% 

Item  2 

Time 

Demand 

naive 

Err 

Abs  err 

Sq  err 

APE 

1 

10 

0% 

2 

6 

10 

-4 

4 

16 

67% 

3 

14 

6 

8 

8 

64 

57% 

4 

10 

14 

-4 

4 

16 

40% 

5 

6 

10 

-4 

4 

16 

67% 

6 

14 

6 

8 

8 

64 

57% 

7 

14 

14 

0 

0 

0 

0% 

8 

10 

14 

-4 

4 

16 

40% 

9 

6 

10 

-4 

4 

16 

67% 

10 

10 

6 

4 

4 

16 

40% 

Stdev 

Avg 

CV 

3.2659863 

10 

0.3265986 

MAE 

MSE 

CIMIP 

MAPE 

4.44 

24.89 

60% 

43% 

Item  3 

Time 

Demand 

naive 

Err 

Abs  err 

Sq  err 

APE 

1 

10 

0% 

2 

3 

10 

-7 

7 

49 

233% 

3 

17 

3 

14 

14 

196 

82% 

4 

10 

17 

-7 

7 

49 

70% 

5 

3 

10 

-7 

7 

49 

233% 

6 

17 

3 

14 

14 

196 

82% 

7 

17 

17 

0 

0 

0 

0% 

8 

10 

17 

-7 

7 

49 

70% 

9 

3 

10 

-7 

7 

49 

233% 

10 

10 

3 

7 

7 

49 

70% 

Stdev 

5.7154761 

MAE 

7.78 

Avg 

10 

MSE 

76.22 

CV 

0.5715476 

CIMIP 

30% 

MAPE 

107% 

All  four  accuracies  of  naive  forecasts  are  higher  in  item  1,  which  has  the  smallest 
coefficient  of  variability  in  the  dataset.  The  opposite  also  holds  as  the  worst  accuracy 
results  in  all  metrics  were  obtained  in  the  item  that  has  the  highest  coefficient  of 
variability.  Since  this  analysis  is  at  the  item  level,  we  applied  CIMIPj*,  Equation  (3.3). 
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Figure  8.  Different  Levels  of  Variability 


Items’  demands  were  designed  to  provide  clear  understanding  of  existing  different  levels 
of  variability. 

According  to  the  example,  with  naive  method,  material  with  lower  level  of 
variability  generates  relatively  accurate  forecast,  while  material  with  higher  level  of 
variability  generates  relatively  poor  forecasts. 

The  summing  of  all  of  individual  accuracy  results,  in  a  big  set  of  items,  should 
provide  the  user  with  a  general  idea  of  how  difficult  the  population  of  material  is  to 
forecast.  A  large  error  signifies  a  difficult  population,  while  a  small  error  signifies  a 
simple  population. 

In  the  same  manner  that  investors  expect  their  asset  managers  to  provide  value 
greater  than  a  passively-managed  investment,  so  too  should  the  DOD  expect  its  material 
managers  to  generate  forecasts  that  generally  perform  better  than  a  naive  method 
benchmark.  Table  14.  and  Figure  9.  demonstrate  how  utilizing  a  naive  benchmark  like 
this  would  give  DOD  leadership  a  better  understanding  of  how  well  its  components  were 
actually  forecasting.  While  the  Navy  is  more  interested  in  improving  its  own  forecasting 
efforts,  the  DOD  needs  to  be  able  to  accurately  assess  the  performance  of  all  five 
reporting  agencies. 
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Table  15.  Theoretical  Forecast  and  Benchmark  Performance 


Year 

Army 

Forecast 

Accuracy 

Army  Naive 
Benchmark 

Navy 

Forecast 

Accuracy 

Navy  Naive 
Benchmark 

Air  Force 

Forecast 

Accuracy 

Air  Force 

Naive 

Benchmark 

DLA 

Forecast 

Accuracy 

DLA  Naive 

Benchmark 

2011 

30% 

40% 

45% 

40% 

55% 

60% 

90% 

85% 

2012 

40% 

35% 

55% 

45% 

60% 

75% 

80% 

90% 

2013 

10% 

30% 

49% 

42% 

64% 

55% 

70% 

80% 

2014 

32% 

25% 

50% 

40% 

62% 

65% 

75% 

85% 

2015 

40% 

30% 

48% 

45% 

70% 

70% 

80% 

90% 

Average 

30% 

32% 

49% 

42% 

62% 

65% 

79% 

86% 

Numbers  are  fictional.  This  table  demonstrates  how  naive  method  benchmarks  can  bring 
forecast  accuracy  results  into  perspective,  in  a  similar  way  that  TSP  fund  perfonuance  is 
compared  to  a  benchmark  index. 


Figure  9.  Theoretical  Chart  Comparing  Navy  Versus  DLA  Forecasting 

Efforts  (Numbers  are  Fictional) 

100% 

80% 

60% 

40% 

20% 

0% 

This  figure  intends  to  demonstrate  that  if  a  manager  considered  forecast  accuracy  in 
isolation  then  they  would  conclude  that  DLA  was  outperforming  the  Navy,  but  if  the 
manager  was  provided  with  benchmarks  then  they  may  reach  the  opposite  conclusion. 


2011  2012  2013  2014  2015 

■  Navy  Forecast  Accuracy  □  Navy  NaTve  Benchmark 

■  DLA  Forecast  Accuracy  □  DLA  NaTve  Benchmark 


2,  Tests  of  Desirable  Characteristics 

We  selected  four  characteristics  regarded  as  relevant  to  any  reliable  forecast 
accuracy  metric,  as  follows:  sensitivity  to  volume  heterogeneity,  symmetry  on  error 
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treatment,  robustness  at  individual  and  aggregated  levels  and  allowanee  for  a  fair 
eomparison. 

In  order  to  provide  a  means  to  a  eomparison  between  aeeuraey  metrics,  we 
designed  particular  tests  to  each  one  of  the  desirable  characteristics.  In  the  end  of  this 
section,  we  gathered  results  in  a  judgment  table  to  point  the  best  metric. 

a.  Sensitivity  to  Volume  Heterogeneity 

Assuming  all  items  are  of  equal  value,  pure  forecast  accuracy  aggregated  metric 
must  give  equal  importance  to  each  item.  Otherwise,  if  any  kind  of  weight  is  applied  to 
specific  items,  results  can  be  seriously  harmed.  Since  the  impact  of  unit  cost  variation  in 
C/M/P/ has  already  been  tested  in  this  research,  we  still  need  to  test  whether  its  results  are 
potentially  dominated  by  large  forecasts  and  actual  demands.  It  is  obvious  that  different 
items  contribute  different  amounts  to  the  overall  CIMIPf.  But,  since  the  item  weight  is 
composed  of  the  demand  volume  and  the  unit  cost,  the  degree  to  which  high-volume 
items  contribute  disproportionately  in  any  given  dataset  is  an  empirical  question  (again, 
assuming  equal  proportionality  is  what  is  desired).  In  this  section,  we  test  the  relative 
sensitivity  of  CIMIPf  and  MASE  to  volume  heterogeneity  across  inventory  items. 

We  built  a  test,  comprised  of  two  fictional  datasets  per  accuracy  metric,  to  check 
the  possibility  of  the  generation  of  type  I  errors,  saying  the  forecast  is  accurate  when  it  is 
actually  inaccurate,  and  type  II  errors,  saying  the  forecast  is  inaccurate  when  it  is  actually 
accurate. 

The  first  data  set  was  designed  to  reflect  a  situation  in  which  the  forecast  value  is 
very  close  to  the  actual  demand  in  one  high-volume  item,  but  the  forecast  model 
performs  poorly  in  nine  other  low-volume  items.  In  that  situation,  we  should  expect 
CIMIPf  result  to  tell  that  the  aggregated  accuracy  is  low,  thus  the  forecast  method  is 
performing  poorly.  Otherwise,  type  I  error  arises. 
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Items 

fi 

Table 

16.  High  Volume  and  Type  I  Errors  -  CIMIPf 

Ci  1  fi-aj  1  Ci*ai  Ci*  1  fi-ai  | 

1 

9000 

10000 

$  1,000.00 

1000 

$ 

10,000,000.00 

$  1,000,000.00 

2 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

3 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

4 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

5 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

6 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

7 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

8 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

9 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

10 

5 

10 

$  1,000.00 

5 

$ 

10,000.00 

$  5,000.00 

Sum  $ 

10,090,000.00  $  1,045,000.00 

CIMIPf 

89.64% 

Since  there  is  not  eurrently  a  DOD  threshold  for  what  constitutes  an  aecurate 
foreeast,  we  assume  CIMIPf  >  80%,  to  classify  the  forecast  as  accurate.  The  result  of  this 
data  set  is  not  aligned  to  the  initial  expectation  of  poor  performance.  Therefore,  we  state 
that  the  result  led  to  a  type  I  error. 

The  seeond  data  set  aims  to  represent  the  opposite  situation.  A  high-volume  item 
has  a  poor  forecast,  while  nine  low-volume  items  have  good  quality  on  forecasts.  In  that 
situation,  we  should  expect  that  CIMIPf  result  indieate  a  good  forecast  accuracy. 
Otherwise,  a  type  II  error  is  considered  to  oceur. 


Items 

fi 

Table  17.  High  Volume  and  Type  II  Errors 

aj  Ci  1  fi-ai  1  Ci*ai 

-  CIMIPf 

Cl*  1  fi-ai 

1 

1 

5000 

10000 

$ 

1,000.00 

5000 

$  10,000,000.00 

$  5,000,000.00 

2 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

3 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

4 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

5 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

6 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

7 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

8 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

9 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

10 

9 

10 

$ 

1,000.00 

1 

$ 

10,000.00 

$ 

1,000.00 

Sum _ $  10,090,000.00 _ $  5,009,000.00 


CIMIPf  50.36% 


Using  the  same  threshold  of  CIMIPf  >  80%  to  classify  an  accurate  foreeast,  the 
result  of  this  data  set  is  also  not  aligned  to  the  initial  expectation  of  good  performance. 
Therefore,  we  state  that  the  result  led  to  a  type  II  error. 
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On  the  other  hand,  as  MASE  metrie  requires  a  slightly  different  type  of  data  to  be 
ealeulated.  Henee,  we  ereated  a  very  similar  test,  eomprised  of  two  other  data  sets  that 
refleet  the  same  situation  as  used  to  uneover  the  dominanee  of  high-volume  items  in 
C/M/P/ equation.  Likewise,  the  same  error-type  definitions  held  true. 

The  first  test,  again,  is  the  ease  in  whieh  nine  low-volume  items  have  relatively 
high  foreeast  errors,  while  one  high-volume  item  has  a  relatively  low  foreeast  error.  In 
that  arrangement,  we  should  expeet  the  result  to  tell  a  poor  performanee.  Otherwise,  we 
will  eonsider  the  existenee  of  type  I  error. 


Table  18.  High  Volume  and  Type  I  Errors  -  MASE 


MAE  of  in-sample 


Items 

fi 

ai 

naive 

e, 

qt 

1 

9000 

10000 

2500.00 

1000 

0.40 

2 

5 

10 

2.50 

5 

2.00 

3 

5 

10 

2.50 

5 

2.00 

4 

5 

10 

2.50 

5 

2.00 

5 

5 

10 

2.50 

5 

2.00 

6 

5 

10 

2.50 

5 

2.00 

7 

5 

10 

2.50 

5 

2.00 

8 

5 

10 

2.50 

5 

2.00 

9 

5 

10 

2.50 

5 

2.00 

10 

5 

10 

2.50 

5 

2.00 

MASE 

1.84 

Assuming  a  threshold  of  MASE  <  0.8  to  elassify  an  aeeurate  foreeast,  whieh  is 
undoubtedly  better  than  a  naive  foreeast,  the  result  aligns  with  the  initial  expeetation. 
Therefore,  there  is  no  evidenee  of  type  I  error. 

The  seeond  test  is  about  the  opposite  situation,  as  nine  low-volume  items  have 
good  quality  on  their  foreeasts  and  one  high-volume  item  has  a  poor  foreeast.  We  should 
expeet  a  good  aeeuraey  result. 
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Table  19.  Large  Numbers  and  Type  I  Errors  in  MASE 


Items 

fi 

MAE  of  in-sample 
naive 

Ct 

qt 

1 

5000 

10000 

2500.00 

5000 

2.00 

2 

9 

10 

2.50 

1 

0.40 

3 

9 

10 

2.50 

1 

0.40 

4 

9 

10 

2.50 

1 

0.40 

5 

9 

10 

2.50 

1 

0.40 

6 

9 

10 

2.50 

1 

0.40 

7 

9 

10 

2.50 

1 

0.40 

8 

9 

10 

2.50 

1 

0.40 

9 

9 

10 

2.50 

1 

0.40 

10 

9 

10 

2.50 

1 

0.40 

MASE 

0.56 

Assuming  the  same  threshold  of  MASE  <  0.8  to  elassify  an  aeeurate  foreeast,  the 
result  aligns  with  the  initial  expectation.  Therefore,  we  find  no  evidence  of  a  type  II  error. 

Considering  the  results  of  all  four  tests,  it  appears  CIMIPf  is  less  sensitive  to 
volume  heterogeneity  than  MASE,  and  hence,  more  likely  to  produce  misleading  results 
because  of  volume  heterogeneity. 

b.  Symmetry  on  Error  Treatment 

As  mentioned  before,  forecast  errors  in  inventory  demand  data  are  bounded  to  the 
negative  side,  as  result  of  underestimations,  and  unbounded  to  the  positive  side,  as  result 
of  overestimations.  However,  forecast  methods  are  expected  to  generate  reasonable  errors 
for  the  majority  of  items.  Hence,  we  designed  this  test  to  verify  whether  equivalent 
variations  of  actual  demand  values,  within  a  moderate  range,  to  positive  and  negative 
sides,  can  result  in  different  impacts  for  C/M/P/ than  MASE.  Table  20.  shows  the  initial 
arrangement  of  the  test. 


50 


Table  20.  Initial  Dataset  to  Test  Error  Side  Equality  -  CIMIPf 

Items  fj  aj  Ci  |  fi-ai  |  Ci*ai  Ci*  |  fi-a,  | 

1 

100 

100 

100 

0 

$ 

10,000.00 

$- 

2 

100 

100 

100 

0 

$ 

10,000.00 

$- 

3 

100 

100 

100 

0 

$ 

10,000.00 

$- 

4 

100 

100 

100 

0 

$ 

10,000.00 

$- 

Sum 

$ 

40,000.00 

$ 

CIMIPf 

100% 

Decision  Variable:  A1 


Uniform  distribution  with  parameters: 

Minimum  0.00 

Maximum  50.00 


Decision  Variable:  A2 

Uniform  distribution  with  parameters: 

Minimum  50.00 

Maximum  100.00 


Decision  Variable:  A3 

Uniform  distribution  with  parameters: 

Minimum  100.00 

Maximum  150.00 


Decision  Variable:  A4 


Uniform  distribution  with  parameters: 

Minimum  150.00 

Maximum  200.00 

Considering  that  the  ranges  of  variation  was  designed  to  eause  an  equal 
proportion  of  positive  and  negative  errors,  an  intuitive  result  should  be  that  items  with 
bigger  errors  on  both  sides  would  mostly  eontribute  to  C/M/P/ variations. 
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However,  according  to  Figure  10.  ,  overestimations  seem  to  impose  a  heavier 
pressure  on  C/M/P/ results,  compared  to  what  underestimations  do. 


Figure  10.  Sensitivity  Chart  of  C/M/P/  Equal  Treatment  Test 


The  equivalent  test  applied  on  MASE  is  shown  in  Table  21.  . 
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Table  21 .  Initial  Dataset  to  Test  Error  Side  Equality  -  MASE 


Item  1 

Item  3 

FY13 

FY14 

FYI5 

FYI3 

FYI4 

FYI5 

ft 

100 

100 

100 

ft 

100 

100 

100 

at 

100 

100 

100 

at 

100 

100 

100 

n 

50 

100 

100 

n 

50 

100 

100 

50 

0 

0 

fi-fi-1 

50 

0 

0 

et 

- 

- 

0 

et 

- 

- 

0 

qt 

0 

qt 

0 

Item  2 

Item  4 

FY13 

FYI4 

FYI5 

FYI3 

FYI4 

FYI5 

ft 

100 

100 

100 

f, 

100 

100 

100 

at 

100 

100 

100 

at 

100 

100 

100 

n 

50 

100 

100 

n 

50 

100 

100 

50 

0 

0 

50 

0 

0 

et 

- 

- 

0 

et 

- 

- 

0 

qt 

0 

qt 

0 

MASE 


0 


Decision  Variable:  A1 


Uniform  distribution  with  parameters: 

Minimum  0.00 

Maximum  50.00 


Decision  Variable:  A2 


Uniform  distribution  with  parameters: 

Minimum  50.00 

Maximum  100.00 


Decision  Variable:  A3 


Uniform  distribution  with  parameters: 

Minimum  100.00 

Maximum  150.00 


Decision  Variable:  A4 


Uniform  distribution  with  parameters: 
Minimum 
Maximum 


150.00 

200.00 
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Different  than  what  happened  to  CIMIPf,  the  sensitivity  chart  in  Figure  1 1 .  shows 
that  MASE  gives  balanced  importance  to  errors  in  both  sides. 


Figure  1 1 .  Sensitivity  Chart  of  MASE 


Sensitivity;  MASE 

•300%  -200%  -100%  00%  100%  20  0% 


c.  Robustness  at  Individual  and  Aggregate  Levels 

Acknowledging  the  fact  that  no  forecast  method  is  expected  to  perform  well  in  all 
situations,  we  agree  with  Fildes  (1989)  by  stating  that  individual  level  analysis  is  more 
powerful  for  managers,  as  it  enables  to  locate  the  origins  of  inaccuracy. 

C/M/P/ was  initially  designed  and  has  been  used  to  calculate  an  aggregate  number 
that  represent  the  overall  forecast  performance  of  each  service.  To  do  so,  WSS  has  used 
twelve-month  windows  of  data  to  allow  calculations  of  total  dollar-errors  and  total  dollar- 
demands,  the  two  key  components  of  CIMIPf  equation.  As  mentioned  before  in  this 
chapter,  CIMIPf  is  considered  a  robust  metric  at  the  aggregated  level. 
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However,  when  the  intent  is  to  produee  accuracy  measures  at  the  item  level,  WSS 
managers  take  out  the  summation  signs  and  the  unit  cost  from  the  CIMIPf  original 
formula  (E.  Liskow,  personal  communication,  April  4,  2016),  resulting  in  the  following: 


CM  IP.  =  1 


a, 


(3.2) 


We  infer  that  this  equation  suffers  from  the  same  vulnerability  as  MAPE, 
Equation  (2.7),  which  is  returning  an  infinite  value  when  actual  demand  is  zero.  In  the 
specific  case  of  Navy’s  demand  data,  the  occurrence  of  zero  demands  are  highly  likely,  as 
mentioned  before. 


In  this  research,  we  consider  robustness  as  the  ability  to  produce  valid  results,  not 
undefined,  in  majority  of  situations,  which  is  in  accordance  to  Baker  et  al.  (2006). 
Therefore,  as  CIMIPt  returns  invalid  values  in  a  significant  amount  of  items  in  the  Navy’s 
dataset,  the  metric  is  classified  as  not  robust. 

However,  a  different  approach  is  possible  to  improve  the  robustness  of  CIMIPi 
equation.  Rather  than  taking  the  summation  sign  out,  the  Navy  could  sum  forecast  errors 
of  one  item,  through  the  time.  Unit  cost  is  constant  at  the  item  level  and  is  present  in  both 
summations  of  the  fraction.  Hence,  they  can  be  put  in  evidence  and  cancels  out.  After 
applying  those  adjustments,  the  proposed  equation  should  be: 

CIMIP,.  =  1  -  ^ -  (3.3) 

t=\ 

That  equation  is  only  vulnerable  to  the  specific  case  of  all  actual  demands  being 
zero,  during  the  time  considered.  Therefore,  as  the  time  window  increases,  the  probability 
of  a  zero  value  in  the  denominator  is  expected  to  reduce.  Just  as  an  example  of  the  gain  in 
robustness  that  this  variation  of  the  metric  represents,  when  applied  to  a  five  year, 
quarterly  demand  dataset,  CIMIPi*  was  able  to  return  100%  of  valid  results,  in  contrast  to 
only  52%  of  valid  results  of  CIMIPi  when  applied  to  the  EY15  demand  dataset. 

On  the  other  hand,  MASE  metric  was  originally  designed  to  be  used  in  both 
dimensions  of  measurement,  through  the  time  and  across  the  items,  as  used  by 
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Hyundman  and  Koehler  (2006).  Moreover,  the  denominator  vulnerability  is  related  to  the 
occurrence  of  all  zero  forecast  errors,  instead  of  all  zero  actual  demands  in  ClMlPi*, 
which  is  yet  more  unlikely  to  happen. 

Therefore,  we  can  state  that  MASE  metric  is  potentially  more  robust  than  ClMlPt*, 
although  its  gains  are  not  perceived  in  the  data  considered,  as  the  second  could  generate 
100%  of  valid  results. 

d.  Allowance  for  Fair  Comparison 

Forecast  accuracy  values  are  often  used  as  a  means  of  performance  comparison. 
In  that  context,  it  is  very  important  to  set  the  ground  for  a  fair  comparison  to  occur.  Non- 
relative  metrics  do  not  account  for  the  fact  that  different  datasets  may  comprise  diverse 
amounts  of  variability  that  create  different  levels  of  predictability  and  makes  the 
comparison  in  absolute  numbers  unfair.  Therefore,  comparisons  of  ClMlPf  results  at  the 
aggregated  and  individual  levels  tend  to  be  harmed  by  different  levels  of  demand 
predictability  in  each  dataset.  MASE,  conversely,  uses  naive  method  as  a  benchmark  to 
account  for  the  level  of  demand  predictability. 

Table  22.  helps  to  explain  the  difference  in  the  interpretation  of  results. 

Table  22.  Difficulty  to  Forecast  Test 

CV  CIMIPj,  MASE 

More  Predictable  0.125494  92.23%  0.49 

Less  Predictable _ 1.937644  -0.001%  0.57 

Values  were  calculated  using  data  from  two  real  items,  picked  as  representatives  of  high 
and  low  coefficients  of  variation. 

Considering  a  threshold  of  ClMlPi*  >  80%  to  classify  an  accurate  forecast,  only 
the  forecasts  of  the  “more  predictable”  item  qualifies.  To  keep  consistent,  we  applied  a 
threshold  of  MASE  <  0.80  to  classify  as  an  accurate  forecast.  By  doing  so,  forecasts  of 
both  items  surpass  the  requirement. 

Based  on  this  example,  we  see  that  if  the  forecast  metric  is  to  be  used  to  compare 
accuracy  of  item  forecasts  (to  compare  IM’s  for  example)  MASE  may  do  a  better  job 
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controlling  for  the  underlying  variability  of  the  data,  and  present  a  better  picture  of  the 
relative  performance  on  eaeh  item  (or  by  each  IM).  Of  course,  this  is  a  simplification. 
MASE  controls  for  only  one  souree  of  variation:  single  period  autoeorrelation.  Still,  the 
point  is  that  not  all  datasets  are  equally  predietable,  and  caution  should  be  used  when 
comparing  the  accuracy  of  organizations  managing  different  populations  of  material. 

Extrapolating  this  result  to  the  aggregated  level,  we  can  assume  a  hypothetical 
scenario  of  two  datasets  where  one  is  mostly  comprised  of  more  predictable  items  and  the 
other  is  mostly  comprised  of  less  predietable  items.  When  measuring  aecuraey  in 
absolute  numbers,  the  results  of  the  seeond  dataset  will  more  likely  be  worse  than  the 
first.  Alternatively,  MASE  benchmarks  performance  against  the  naive  method,  whieh 
enables  the  less  predietable  dataset  to  generate  a  relatively  better  result  than  the  more 
predietable  dataset. 

D,  CHAPTER  SUMMARY 

The  main  objectives  of  this  chapter  were  to  uncover  evidences  of  inherent  flaws 
of  CIMIPf  metric,  through  the  applieation  of  specifie  tests,  as  well  as  to  draw  a 
comparison  to  an  alternative  metric,  found  in  the  literature. 

The  key  lessons  of  the  C/M/P/ metric  evaluation  were: 

•  Type  I  and  Type  II  errors  are  expected  to  occur; 

•  It  can  generate  counter  intuitive  (e.g.,  negative)  results; 

•  The  composition  of  the  data  set  (e.g.,  level  of  variability)  infiuenees  its 
results. 

Additionally,  Table  23.  aims  to  summarize  the  results  of  the  tests  eontained  on 
the  comparative  analysis. 

Table  23.  Ranked  Comparison  of  MASE  and  CIMIPf 


Desirable  Characteristics 

MASE 

CIMIP, 

Dominance  of  high-volume 

1 

2 

Error  side  equality 

1 

2 

Robustness  at  aggregate  and  individual 

levels 

1 

1* 
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Allow  for  comparability  between  items  1  2 

This  table  ranks  each  desirable  characteristic.  *  Grade  attributed  in  case  C/M/P, « is  used. 

In  addition  to  demonstrating  the  theoretical  problems  with  CIMIPf,  we  compared 
it  to  another  metric  that  has  been  highly  recommended  in  the  literature.  Our  comparison 
was  based  on  the  numerical  analysis  of  a  set  of  generated  examples,  whieh  are  not 
representative,  so  the  generalization  of  the  findings  is  problematic.  Based  on  our  test  set, 
it  appears  that  C/M/P/  performs  poorly  relative  to  MASE. 
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IV.  ANALYSES  ON  FORECAST  PROCEDURES 


A,  INTRODUCTION 

This  chapter  presents  the  calculations  involved  in  the  generation  of  a  flexible 
foreeast  model  rather  than  applying  a  fixed  forecast  method  as  a  solution  that  fits  all  the 
items.  The  model  uses  a  pool  of  forecast  methods  and  foreeast  aecuracy  metrics,  applied 
at  the  item  level,  as  a  means  to  optimize  the  selection  of  the  forecast  method  to  mitigate 
the  expected  error  in  forecasting. 

B,  BACKGROUND  ON  CURRENT  NAVY’S  FORECASTING  PROCESS 

NAVSUP  is  tasked  with  managing  over  350,000  lines  items  (E.  Liskow,  personal 
communication,  April  4,  2016)  as  they  progress  through  six  LCI  eategories.  Ed’s  1  and 
2  cover  the  period  from  initial  operational  eapability  to  the  material  support  date  when 
there  is  little  to  no  historieal  demand  data,  while  ECI  3  occurs  during  the  demand 
development  interval.  ECI’s  4  and  5  cover  the  periods  when  the  weapon  system  program 
is  mature  and  has  been  identified  for  retirement,  while  ECI  6  covers  the  period  after  the 
offieial  retirement.  The  way  the  Navy  forecasts  demand  is  different  throughout  each  of 
these  ECI’s,  yet  in  this  paper  we  will  only  foeus  on  the  forecasting  procedures  for  Ed’s 
4  and  5.  Currently,  ECI  4  consists  of  approximately  284,000  lines  items  and  ECI  5 
eonsists  of  approximately  23,000  lines  items  (E.  Eiskow,  personal  communication,  April 
4,  2016);  yet  only  about  40,000  of  these  lines  items  generate  aetual  demand  in  a  given 
year  and  meet  the  CIMIP  definition  of  a  foreeastable  item.  The  Navy  utilizes  a 
customized  Enterprise  Resource  Planning  (ERP)  program  to  generate  forecasts  for  all 
Ed  4  and  5  line  items,  yet  not  all  of  these  forecasts  will  factor  into  the  CIMIP  foreeast 
accuracy  metrics. 

In  a  broad  sense,  the  foreeasting  proeess  begins  by  segregating  the  global 
wholesale  demand  for  the  previous  five  years  in  to  20  quarterly  buckets.  It  is  important  to 
note  that  this  wholesale  demand  is  not  the  retail,  or  end  unit,  demand,  but  rather  the 
replenishment  purehases  made  by  the  purchasing  agents  at  the  wholesale  level.  With 
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these  20  quarters  of  historieal  demand  caleulated  for  all  LCI  4  and  5  line  items,  ERP  runs 
an  exponential  smoothing  with  baekcasting  algorithm,  utilizing  a  smoothing  faetor,  or 
alpha  (a),  equal  to  0.2.  From  these  calculations  ERP  generates  a  constant  quarterly 
forecast  for  the  next  five  years.  Since  the  forecasted  demand  is  constant,  it  is  sufficient  to 
multiply  one  quarter  by  four  to  generate  the  annual  forecasts  for  the  next  five  years.  This 
forecasting  process  is  repeated  every  quarter  in  an  attempt  to  capture  demand  changes  in 
the  items  with  higher  variability.  The  forecasts  generated  by  ERP  are  also  subject  to 
review  by  their  IM  who  has  the  option  to  modify  them  as  they  deem  appropriate.  Elpon 
completion  of  the  IM  review,  the  demand  forecast  is  finalized  and  published  for  use  in 
purchasing  and  other  material  management  decisions. 

The  Office  of  the  Secretary  of  Defense  for  Supply  Chain  Integration  requires  that 
each  component  report  their  forecast  metrics  semi-annually  at  the  inventory  management 
review.  In  April  and  October  NAVSUP  generates  the  Navy’s  official  CIMIP  accuracy 
and  bias  metrics  by  comparing  the  original  forecast  for  the  preceding  12  months  with  the 
actual  demand  during  that  period.  Since  the  beginning  of  CIMIP  metric  reporting  in 
FY13,  NAVSUP  has  made  attempts  to  improve  their  forecasting  results  by  correcting 
erroneous  data  and  identifying  the  specific  line  items  with  the  most  significant 
forecasting  errors  (E.  Eiskow,  personal  communication,  April  4,  2016).  While  current 
capabilities  have  made  it  necessary  to  utilize  a  one-size-fits-all  forecasting  model,  in  the 
future  they  plan  to  enhance  their  ability  to  generate  tailored  forecasts  for  those  items 
which  the  one-size-fits-all  forecasting  method  produces  inferior  results  (E.  Eiskow, 
personal  communication,  April  4,  2016). 

C.  OBJECTIVE  OF  THE  MODEL 

The  mathematical  model  applied  in  this  chapter  aims  to  fill  the  existing  gap 
between  the  current  forecast  process  that  uses  a  fixed  method  with  fixed  parameters  and 
the  desired  stage  of  a  tailored  solution.  The  limitation  of  the  model  is  that  we  arbitrarily 
chose  the  parameters  to  initiate  the  calculations,  instead  of  using  computational  tools  to 
optimize  the  choice. 
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As  mentioned  before  in  this  research,  DOD  requests  the  generation  of  an  accuracy 
number  that  is  capable  to  represent  the  overall  performance  of  the  components  in 
forecasting  the  items’  demand  in  a  given  fiscal  year.  Those  measures,  combined  with  a 
certain  threshold,  aims  to  induce  improvements  in  the  components’  processes  of 
forecasting,  what  is  expected  to  help  in  the  effort  of  reducing  the  excess  inventory. 

Additionally,  we  consider  that  the  Navy’s  forecasters  can  benefit  from  the 
accuracy  measures  to  improve  their  works.  The  idea  is  to  use  those  measures  as  a  means 
to  identify  relevant  deviations  and  to  help  in  deciding  about  the  most  effective  way  to 
generate  the  forecasts.  Hence,  from  the  perspective  of  forecasters,  the  information  needed 
is  slightly  different.  Rather  than  generating  a  number  that  represents  the  overall  ability  to 
produce  accurate  forecasts  in  a  given  period,  a  new  approach  should  be  the  measurement 
of  an  item’s  accuracy,  along  the  time. 

We  also  acknowledge  the  fact  that  there  is  no  absolute  best  forecast  method, 
capable  to  generate  the  most  accurate  values  for  each  one  of  the  line  items.  Therefore,  we 
designed  a  test  that  aims  to  test  whether  there  are  particular  patterns  of  demand  in  which 
specific  forecast  methods  tend  to  outperform  the  others.  Moreover,  we  intend  to  present 
an  aid  for  decision  making,  when  a  forecaster  is  dealing  with  an  extensive  and 
heterogeneous  set  of  items’  demands. 

D,  MODEL  DESIGN 

In  order  to  generate  the  required  information,  we  built  a  flexible  forecast  model, 
which  selects  each  individual  item,  generates  forecasts  values  using  a  pool  of  forecast 
methods  and  measures  accuracy  in  a  particular  way  to  identify  the  forecast  method  that 
mitigates  the  forecast  error.  Once  the  whole  data  is  trimmed,  a  cycle  of  events  takes  place 
in  order  to  generate  the  intentioned  information.  Figure  12.  shows  the  sequence  of  tasks 
involved  in  the  model. 
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Figure  12.  Model’s  Flow  Chart 
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The  following  sections  will  describe  the  relevant  tasks  of  the  model. 


1,  Trim  the  Data 

The  original  data  set  used  to  initiate  the  model  comprehends  five  years  of  past 
demand  of  80,427  NllNs.  Demand  data  is  grouped  into  20  bins,  each  one  representing  a 
quarter  of  fiscal  year.  In  order  to  allow  the  calculations  of  six  different  forecast  methods 
and  four  different  accuracy  metrics,  the  items  that  did  not  meet  the  minimum 
requirements  were  withdrawn. 

One  limitation  of  the  model  used  in  this  analysis  is  that  one  of  the  forecast 
methods  and  one  of  the  accuracy  metric  are  not  able  to  generate  valid  results  in  all 
situations.  In  order  to  avoid  invalid  results,  considered  as  infinite,  the  data  set  has  to  be 
trimmed  to  comprise  only  items  that  fulfill  two  conditions:  variable  demand  in  the  first 
four  periods  and  at  least  one  demand  of  size  bigger  than  zero  in  the  last  eight  periods. 
Applying  those  conditions,  30,472  items  remained  out  of  a  total  dataset  of  80,427  items. 
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2,  Separate  Fit  and  Test  Periods 

Following  the  procedure  existing  in  Makridakis  et  al  (1998),  each  item’s  demand 
is  broken  into  two  pieces.  The  first  is  called  fit  period  and  the  second  is  called  test  period. 
The  first  set  of  data  corresponds  to  the  first  12  periods  and  is  basically  used  to  initiate  the 
forecast  methods.  The  second  is  formed  by  the  demand  on  the  subsequent  eight  periods 
and  is  used  to  test  the  difference  between  the  forecast  generated  and  the  actual  demand. 
Figure  13.  present  the  demand  and  the  two  periods  of  a  sample  item  in  a  visual  form. 


Figure  13.  Fit  and  Test  Periods 
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The  blue  curve  shows  the  demand  of  item  NUN  01-464-6078.  The  dashed  line  in  red  is  the 
break  point  of  fit  and  test  periods.  Forecasts  are  generated  from  period  13  to  20  in  order  to 
allow  comparisons  to  the  actual  demand. 


3,  Calculate  Forecasts 

Makridakis  et  al.  (1998)  define  three  categories  of  forecasts:  quantitative, 
qualitative  and  unpredictable.  All  quantitative  methods  assume  that  the  identified  pattern 
of  past  demand  is  expected  to  hold  in  the  future.  Additionally,  time  series  is  the  name  of  a 
family  of  forecast  methods  existing  in  the  quantitative  category. 

Considering  the  fact  that  no  item  in  LCI  4  and  5  is  expected  to  generate  demand 
shifts,  trends  or  seasonality,  we  assumed  that  the  demand  pattern  is  stationary.  Hence,  our 
forecasting  model  comprises  six  of  the  simplest  time  series  forecast  methods  found  in 
literature. 
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We  selected  two  averaging  methods,  two  exponential  smoothing  methods,  a 
combination  of  methods  and  the  one  that  the  Navy  is  currently  using.  We  used  the  same 
taxonomy  of  (Makridakis  et  ah,  1998)  to  present  the  methods,  as  follows: 

a.  Simple  Average  (S A) 

This  method  averages  all  available  demand  data,  according  to  the  following 
equation: 

(3-4) 

t  /=1 

where: 

t  =  amount  of  available  demand  data  at  the  moment  that  the  forecast  is  generated. 

Hence,  as  the  variable  i  increases,  the  amount  of  available  demand  points  also 
increases,  making  the  SA  to  consider  more  data. 

b.  Moving  Average  (M A) 

As  opposite  to  what  happens  in  SA,  this  method  averages  a  fixed  amount  of  the 
most  recent  demand  data.  The  mentioned  fixed  amount  of  observations  is  called  as  order 
of  average.  The  MA  equation  follows: 

Z  (3.5) 

^  i=t-kA\ 

where: 

k  =  order  of  average 

The  smaller  the  order  of  average,  the  more  responsive  to  peaks  and  shifts  in 
demand  the  method  turns.  For  this  research,  we  used  a  MA  of  order  12,  the  exact  size  of 
the  fit  period,  as  a  mean  to  keep  the  method  smooth. 

c.  Single  Exponential  Smoothing  (SES) 

In  this  method,  the  forecast  is  a  function  of  the  immediate  past  forecast,  adjusted 
by  the  last  forecast  error. 
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/r+1  =/r+«(^J 


(3.6) 


where; 

^  =  smoothing  factor.  It  is  a  chosen  fixed  value  between  zero  and  one; 

=  forecast  error,  Equation  (2.1) 

The  forecast  error  is  used  to  correct  the  past  forecast  value  to  the  opposite 
direction,  when  calculating  the  next  forecast.  Hence,  a  plays  the  important  role  of 
weighting  the  importance  of  the  last  forecast  error.  Higher  values  of  (X  makes  the  impact 
of  last  forecast  error,  on  the  next  forecast,  to  be  higher.  As  a  values  increases,  the  method 
turns  more  responsive,  or  less  smooth.  The  opposite  condition  also  holds,  as  lower  a 
values  imply  a  more  smooth  method.  Hence,  an  a  value  can  be  calculated  to  optimize  the 
results  in  a  specific  accuracy  metric.  However,  when  the  value  is  found,  it  is  used  as  a 
constant  throughout  the  time,  thus  disregarding  any  possible  change  in  demand  pattern. 

Finally,  this  method  implies  the  use  of  two  parameters,  before  initiating  the 
calculations.  The  first  is  a  and  the  second  is  value,  from  which  all  the  subsequent 

forecast  values  and  forecast  errors  are  generated  and  adjusted.  Although  we  acknowledge 
the  possibility  of  finding  optimal  values  of  the  two  parameters,  our  forecast  model  fixes 

«  =  0.1  and  fx=(\. 

d.  Adaptive-Response-Rate  Single  Exponential  Smoothing  (ARRSES) 

This  method  aggregates  the  idea  of  a  flexible  a  to  the  SES  method.  Therefore: 

/<+i  (3-V) 
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where; 


A,  =/?<., +(1-/?)A_, 

w,=/»KI+(i-/»)w, 

is  a  eonstant  value  between  zero  and  one  and  relates  to  the  degree  in  whieh  a 
values  are  allowed  to  vary,  along  the  time.  The  initialization  of  ARRSES  eomprises  a 
bigger  set  of  fixed  parameters,  as  opposed  to  the  SES  that  needs  only  /j  and  a  values. 
Our  foreeast  model  eonsiders  the  same  parameters  used  by  (Makridakis  et  ah,  1993): 

~  ^1  5 

«2  =  «3  =  «4  =  0.2; 

y9  =  0.12; 

A;  =  Ml  =  0 


e.  Combination 


As  mentioned  in  the  literature  review,  there  is  an  expeeted  gain  in  applying  a 
eombination  of  foreeast  methods,  when  all  of  them  individually  generate  poor  results. 
Henee,  this  method  is  just  a  simple  average  of  foreeast  values  obtained  by  the  other  four 
methods  exposed  thus  far.  The  eorresponding  equation  is: 


/,=-Z4 


m 


x=l 


(3.8) 


where; 

X  =  method  index 

ft,x  =  foreeast  generated  by  the  eorresponding  method  for  the  index  x,  at  time  t 
m  =  amount  of  methods  to  be  eombined 
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Therefore,  our  forecast  model  applies  indexes  from  one  to  four  to  the  previous 
methods,  resulting  in  the  use  of  m  =  4. 

/.  Exponential  Smoothing  with  Backcasting 

This  method  is  a  variation  of  SES,  in  which  the  initialization  value  of  /;  is 
obtained  by  applying  the  inverse  process  of  forecasting.  This  particular  way  to  initiate  the 
SES  was  studied  and  recommended  by  (Ledolter  and  Abraham,  1984)  and  is  currently 
used  by  the  Navy’s  ERP.  Hereafter,  we  will  refer  to  this  variation  of  SES  as  the  NAVY 
method. 

A  short  description  of  how  the  NAVY  method  follows:  first,  the  condition  f,  =a^ 
is  applied,  meaning  that  the  most  recent  forecast  value  equals  to  the  most  recent  actual 
demand.  Then,  a  fixed  a  value  are  applied  to  obtain  backcast  values  for  periods  starting 
from  t  -  1  toward  t  =  \,  as  opposite  to  the  generation  of  forecast  value,  which  is 
calculated  for  the  period  t  +  \.  Our  model  applies  the  same  smoothing  factor  as  used  by 
the  Navy’s  ERP. 

The  process  of  generating  backcasts  is  kept  until  the  /j  value  is  obtained. 
Thereafter,  a  regular  SES  forecast  method  can  be  initiated. 

4,  Measure  Accuracy  at  the  Item  Level 

After  calculating  all  the  different  forecasts  for  the  test  period,  some  process  has  to 
take  place  to  identify  the  most  accurate  method.  The  following  chart  aims  to  show  the 
different  forecasts  generated  and  how  difficult  it  can  be  to  rank  the  methods  by  accuracy. 
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Figure  14.  Sample  of  Forecast  Generation 


^^Actual  Demand  - MA  - SA  - SES  - ARRSES  COMB  - NAVY 

The  vertical  axel  shows  the  demand  sizes.  The  colored  curves  show  the  different  forecasts 
generated  for  the  same  item  exposed  in  the  Figure  13.  .  It  also  shows  that  the  differences  in 
accuracy,  among  the  methods,  are  not  always  visually  identifiable. 

In  order  to  utilize  a  quantitative  approach  for  the  selection  of  the  best  forecast 
method  for  a  specific  item,  we  applied  a  pool  of  four  accuracy  metrics.  All  the  accuracy 
metrics  used  in  this  analysis  were  discussed  in  detail  in  Chapters  11  and  111. 

First,  we  selected  MAE  and  MSE,  respectively  Equations  (2.4)  and  (2.2),  as  they 
are  reported  to  be  commonly  used  in  real  situations  and  can  generate  valid  results  when 
actual  demands  are  zero.  The  fragility  of  generating  numbers  with  units  does  not  harm 
the  result’s  quality  at  the  item  level.  Additionally,  we  selected  CIMIPi*  and  MASE, 
respectively  Equations  (3.3)  and  (2.23),  because  the  first  is  currently  used  by  DOD,  to 
assess  the  component’s  performance,  and  the  second  is  the  alternative  metric  presented  in 
Chapter  111,  while  making  the  comparative  analysis. 

Table  24.  summarizes  the  results  of  four  forecast  accuracy  measurements  for 
each  one  of  the  six  forecast  methods  applied  to  a  randomly  selected  sample  item  from  the 
dataset. 
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Table  24.  Summary  of  Accuracy  Results 


NIIN 

14646078 

Demand  Description 
Mean  1837.75 

STD  196.8079146 

CV  0.107091778 


Forecast  Methods 


Simple  average  Moving  average  SES 


MAE 

MSE 

MASE 

CIMIP 


174.59 

36796.09 

0.79 

0.90 


MAE  171.72 

MSE 

MASE 

CIMIP 


36912.69 

0.78 

0.91 


MAE 

MSE 

MASE 

CIMIP 


182.20 

37129.76 

0.83 

0.90 


ARRSES 

MAE  218.27 

MSE  57422.29 

MASE  0.99 

CIMIP  0.88 


Combination  NAVY 

MAE  185.01  MAE  195.23 

MSE  40122.25  MSE  43940.14 

MASE  0.84  MASE  0.89 

CIMIP  0.90  CIMIP  0.89 


Highlighted  in  yellow  are  the  aeeuraey  metries’  ehoiees  of  most  aeeurate  foreeast 
methods. 


5,  Rank  the  Forecast  Methods  hy  Accuracy  Metric 

In  order  to  identify  the  best  and  worst  forecast  method  for  any  particular  item,  we 
generated  rankings  for  each  one  of  the  accuracy  metrics.  MAE,  MSE  and  MASE  results 
are  considered  better  when  values  are  low.  On  the  other  hand,  CIMIPi*  results  are 
considered  better  as  the  values  are  high. 

The  following  table  considers  the  results  exposed  in  Table  24.  to  form  the 
rankings  within  each  one  of  the  accuracy  metrics  used. 


Table  25.  Ranking  of  Forecast  Methods  by  Accuracy  Metric 


Simple  average 

Moving  average 

Simple  Exponential  Smoothing 

ARRSES 

Combination 

NAVY 

MAE 

2 

MAE 

1 

MAE 

3 

MAE 

6 

MAE 

4 

MAE  5 

MSE 

1 

MSE 

2 

MSE 

3 

MSE 

6 

MSE 

4 

MSE  5 

MASE 

2 

MASE 

1 

MASE 

3 

MASE 

6 

MASE 

4 

MASE  5 

CIMIP 

2 

CIMIP 

1 

CIMIP 

3 

CIMIP 

6 

CIMIP 

4 

CIMIP  5 

For  this  particular  item,  using  MAE  as  the  selected  accuracy  metric,  Moving  Average  is 
the  forecast  method  that  is  expected  to  minimize  the  errors  between  forecast  values  and 
actual  demand. 


6,  Count  of  Best  Ranks 

This  analysis  aims  to  investigate  the  skewness  of  best  ranks  distribution, 
considering  the  underlying  methodologic  differences  of  the  four  accuracy  metrics 
mentioned.  In  other  words,  we  test  if  a  particular  forecast  method  is  considered  the  most 
accurate  for  the  majority  of  items  contained  in  the  trimmed  data. 
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7,  Generate  Overall  Accuracy  Ranking  at  the  Item  Level 

We  consider  that  the  most  accurate  method  to  forecast  demand,  for  the  specific 
item  considered,  is  the  one  that  generates  the  lowest  median  of  ranks,  as  shown  in  Table 
26.  and  Table  27. 


Table  26.  Overall  Ranks 


SA 

MA 

SES 

ARRSES 

COMB 

NAVY 

Overall  rank 

2 

1 

3 

6 

4 

5 

Table  27.  Best  and  Worst  Forecast  Methods 

Best  Method  MA 

Worst  method _ ARRSES 

Considering  all  four  accuracy  metrics’  results,  Moving  Average  is  considered  the  most 
accurate  forecast  method  for  this  item,  as  it  generates  the  lowest  overall  rank. 

8,  Build  Clusters 

In  order  to  allow  the  investigation  of  the  possibility  of  one  forecast  method  to  be 
capable  of  outperforming  all  the  others  for  a  specific  group  of  items,  we  created  11 
clusters  of  items,  each  one  of  those  corresponding  to  a  specific  range  of  coefficients  of 
variation  (CV).  Hendricks  and  Robey  (1936)  explain  the  coefficient  of  variation  as  the 
ratio  of  the  standard  deviation  of  a  number  of  measurements  to  their  arithmetic  mean. 
This  ratio  provides  a  standard  for  overall  variability  assessment  since  the  number  is  scale 
free,  and  can  be  used  to  compare  datasets. 

The  following  histogram  shows  the  CV  clusters,  along  with  the  amount  of  items 
contained. 
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Figure  15.  Histogram  of  Coefficient  of  Variation 
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9,  Generate  Rankings  on  Clusters  of  Coefficient  of  Variation 

In  order  to  elect  the  best  forecast  method  for  a  specific  cluster  of  coefficient  of 
variation,  we  counted  the  number  of  items  in  which  each  of  the  forecast  methods  was 
considered  the  best  and  the  worst  option.  The  sample  chart  below  shows  how  the  rank 
results  stored  in  a  given  cluster  of  CV. 
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Figure  16.  The  Best  and  Worst  Methods  Within  a  Cluster 


■  Best  Method  ■  Worst  Method 

392 


SA  MA  SES  ARRSES  COMB  NAVY 


Results  collected  in  the  CV  cluster  of  0.0-0. 4.  The  vertical  axel  represents  the  amount  of 
items,  while  the  horizontal  axel  shows  the  forecast  methods.  In  this  case,  ARRSES  is  the 
most  frequently  considered  best  and  worst  method.  That  information  provides  the  idea  of 
risk  involved  in  the  decision  of  selecting  a  specific  forecast  method. 

10,  Generate  MASE  Scores  of  Clusters 

As  a  different  approach  to  the  use  of  ranks  to  track  the  performance  of  forecast 
methods,  we  calculated  the  average,  minimum  and  maximum  MASE  values  within  each 
cluster  of  coefficient  of  variation.  The  intention  is  to  identify  a  pattern  of  relative 
performance  as  the  CV  increases,  compared  to  what  naive  method  produces.  Moreover, 
those  three  values  of  MASE,  measured  along  the  time,  provide  the  range  of  possible 
results  to  inform  about  the  existing  risk  of  choosing  that  specific  method  for  the  entire 
population. 

11,  Assess  the  Relative  Performance  of  Navy’s  Forecast  Method 

We  used  the  MASE  accuracy  metric  in  order  to  measure  the  potential  gain  of 
implementing  different  forecasting  methods,  instead  of  the  Navy’s  status  quo.  First,  we 
counted  the  percentage  of  items  in  which  the  NAVY  method  is  not  the  best,  meaning  that 
there  is  opportunity  to  increase  accuracy  by  using  another  forecast  method. 
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Additionally,  we  counted  the  percentage  of  times  that  the  Navy’s  forecast  method 
performed  worse  than  naive;  which  means  MASE  values  higher  than  one.  Then,  out  of 
that,  we  counted  how  many  times  another  method  was  capable  of  outperforming  the 
naive. 


12,  Measure  the  Level  of  Agreement  between  MASE  and  CIMIPi* 

In  order  to  complement  the  comparative  analysis  conducted  in  the  Chapter  III,  we 
measured  the  amount  of  times  that  rank  results  of  MASE  and  ClMlPt*  agree.  The  idea  is 
to  provide  the  magnitude  of  the  existing  theoretical  difference  among  the  metrics,  using 
real  data. 

E,  RESULTS 

The  model  described  is  used  to  calculate  forecast  values,  along  with  the  respective 
accuracy  scores  as  a  means  to  identify  the  method  that  minimizes  the  expected  error  in 
each  item.  This  section  presents  results  grouped  in  to  two  categories:  accuracy  metrics 
and  forecast  methods.  The  first  category  utilizes  real  data  to  complement  the  theoretical 
comparative  analysis  among  CIMIP  and  MASE  accuracy  metrics,  conducted  in  Chapter 
III.  The  second  utilizes  accuracy  measurements  as  a  tool  help  forecasters  in  the  task  of 
optimizing  the  selection  of  a  forecast  method. 

1,  Accuracy  Metrics 

As  mentioned,  there  are  expected  qualitative  gains  in  choosing  MASE  as  a 
substitute  of  CIMIP  metric.  As  the  comparative  analysis  used  small  sets  of  hypothetical 
items  to  demonstrate  some  characteristics  of  the  metrics,  a  relevant  question  remained:  do 
the  results  generated  by  the  new  metric  represent  a  significant  improvement? 

In  order  to  answer  that  question,  we  have  to  consider  that  the  current  procedures 
do  not  formally  involve  the  use  of  accuracy  measures  at  the  individual  level.  Components 
are  just  required  to  generate  aggregated  accuracy  values  to  report  to  DOD  as  a 
representation  of  the  overall  forecast  performance. 
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The  Navy  has  tried  to  implement  CIMIPi  and  LASE,  respectively  Equations  (3.2) 
and  (2.26),  at  the  individual  level  as  an  internal  effort  to  identify  items  that  represent 
significant  sources  of  inaccuracy  within  the  big  data.  The  vulnerabilities  of  those  metrics 
were  exposed  in  Chapter  II,  while  Chapter  III  conclude  that  both  MASE  and  CIMIPi* 
metrics,  respectively  Equations  (2.23)  and  (3.3),  can  be  used  at  the  individual  level. 

However,  the  model  presented  in  this  chapter  has  a  higher  ambition  on  the  use  of 
accuracy  metrics  at  the  individual  level.  Assuming  the  generation  of  multiple  forecasts 
per  item  in  a  given  time,  accuracy  values  can  be  used  as  inputs  to  support  the  decision  of 
selecting  the  best  forecast  method. 

Figure  17.  shows  the  agreement  level  between  MASE  and  CIMIPi*  among 
themselves  and  with  the  overall  rank  generated.  The  agreement  level  can  be  explained  by 
the  percentage  of  times,  considering  all  items,  in  which  the  results  of  two  accuracy 
metrics  lead  to  the  same  conclusion.  This  analysis  uses  ranks  as  the  criteria  to  set  a 
common  ground  for  comparison  among  the  accuracy  metrics. 

Figure  17.  MASE  and  CIMIPi*  Agreement 


■  Agree  ■  Disagree 


All  MASE  and  CIMIP  ranks  MASE  and  CIMIP  on  the  best  method 


The  first  bar  on  the  left  represents  the  pereentage  of  items  that  MASE  and  CIMIPi*  results 
led  to  the  exaet  same  ranks  for  all  six  foreeast  methods  used  in  the  model.  The  seeond  bar 
measures  the  agreement  level  on  eleeting  the  most  aeeurate  foreeast  method. 
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When  forming  complete  rankings  of  forecast  methods,  the  methodologic 
difference  between  MASE  and  CIMIPi*  led  to  the  significant  divergence  of  25%. 
However,  the  main  objective  of  the  whole  model  is  to  provide  useful  information  to 
optimize  the  selection  of  the  most  accurate  forecast  method  for  each  item.  For  that  matter, 
there  is  a  high  agreement  level  of  94%  among  the  accuracy  metrics. 

2.  Forecast  Methods  (Time  Series) 

We  acknowledge  the  fact  that  parameters  used  to  generate  accurate  forecasts  in 
the  past  do  not  guarantee  high  performance  in  the  future.  However,  based  on  the 
assumption  of  demand  stationarity,  we  expect  that  the  selection  of  the  most  accurate 
method  in  past  data  can  result  in  improvements  on  future  forecast  performance. 

This  section  aims  to  uncover  the  existence  of  patterns  that  could  be  used  to  form  a 
decision  rule  on  the  selection  of  the  best  forecast  method.  The  tests  were  conducted  under 
two  main  methods:  analysis  of  ranks  and  MASE  results  analysis. 

a.  Analysis  of  Ranks 

(1)  Whole  Population  of  Items 

Considering  the  completely  trimmed  data,  we  first  count  the  amount  of  items  in 
which  the  forecast  methods  were  considered  the  most  accurate,  by  each  accuracy  metric. 
Results  follow: 
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Figure  18.  Count  of  Best  Ranks  by  Aeeuraey  Metric 
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The  vertical  axel  represents  the  amount  of  items. 

There  is  no  clear  evidence,  in  the  trimmed  data,  that  one  forecast  method  is 
mostly  considered  the  best  option.  While  MSE  results  are  the  most  skewed  toward  SA,  the 
other  three  accuracy  metrics  are  slightly  skewed  toward  ARRSES. 

In  order  to  enable  a  clear  visualization  of  the  overall  skewness  of  ranks,  among 
the  forecast  methods,  we  consolidated  the  counts  of  the  four  accuracy  metrics.  Results  are 
shown  in  Figure  19. 


76 


Figure  19.  Consolidated  Pereentages  of  Best  Ranks 


We  found  that  there  is  no  elear  evidenee,  in  the  trimmed  data  used,  to  support  that 
a  particular  forecast  method  is  capable  of  outperform  the  others  in  a  big  majority  of 
items.  Hence,  further  analyses  are  needed  to  help  in  the  decision  of  selecting  the  most 
accurate  forecast  method. 

(2)  Clusters  of  Coefficient  of  Variability 

Rather  than  try  to  identify  the  most  accurate  forecast  method  for  the  whole 
population  of  items,  the  next  analysis  investigate  the  benefits  of  choosing  a  specific 
forecast  method  in  groups  of  items  that  have  similar  demand  behaviors,  in  terms  of 
amount  of  variability.  Hence,  the  following  analysis  applies  a  rank  analysis,  utilizing 
clusters  of  CV  to  group  items  and.  Results  follow. 
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Figure  20.  Best  and  Worst  Forecast  Methods  by  Cluster 
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The  relation  between  the  blue  and  the  red  bars  provides  an  idea  of  risk  involved  in  the 
choice  of  one  fixed  method  to  be  used  in  a  whole  cluster  of  items. 


There  is  a  pattern,  along  the  clusters,  of  high  risk  in  selecting  one  forecast  method 
to  be  applied  to  the  whole  group  of  items.  Just  as  an  example,  ARRSES  was  most  elected 
best  method,  all  clusters  combined.  At  the  same  time,  it  was  considered  the  worst  option 
more  times  than  all  others.  Hence,  we  can  state  that  there  is  a  significant  risk  of 
inaccuracy  in  choosing  one  method  to  be  used  in  a  cluster  of  CV. 

Another  relevant  investigation  is  about  the  potential  existence  of  upwards  or 
downwards  trends  on  forecast  method  ranks,  as  CV  increases.  Figure  21.  shows  how 
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each  forecast  method  is  ranked  on  the  clusters  of  CV,  based  only  on  the  amount  of  times 
it  was  considered  the  best  option. 


Figure  21 .  Average  Rank  Variation  by  Clusters 


The  vertical  axes  represent  the  aggregated  rank,  which  is  related  to  the  number  of  times 
one  method  was  considered  the  best  option  within  each  cluster.  Trend  lines  are  in  black. 


The  Combination  method  shows  a  constant  worst  rank  in  all  clusters  of  CV,  that 
does  not  mean  that  it  is  the  absolute  worst  method.  What  it  does  mean  is  that  it  is  not 
often  the  best  method,  not  considering  the  insignificant  amount  of  items  in  which  it  was 
considered  the  worst  method.  Additionally,  trend  lines  help  to  explain  a  significant 
amount  of  variance  in  results  of  two  methods.  ARRSES  tends  to  lose  rank  as  CV 
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increases,  though  not  uniformly,  while  NAVY  method,  not  uniformly,  tends  to  gain  ranks 
as  variability  inereases. 

The  drawback  of  the  analysis  of  ranks  is  that  it  does  not  provide  an  aceurate  sense 
of  differentiation  between  methods.  As  shown  in  Figure  20.  ,  differenees  in  eounts  of  best 
rank,  among  the  methods,  sometimes  are  signifieant  or  elearly  irrelevant.  Therefore, 
analysis  of  ranks  may  distort  the  existing  aceuraey  differenee  between  the  methods. 

b.  Analysis  ofMASE  Results 

In  this  section  we  analyze  MASE  results  eolleeted  in  the  test  period  to  seleet  the 
foreeast  method  to  be  used  thereafter.  The  first  analysis  is  set  to  investigate  whether 
foreeast  methods  behave  differently  as  the  eoeffieient  of  variation  inereases,  in  order  to 
indieate  the  use  of  one  for  items  with  less  variable  demands  and  another  for  items  with 
more  variable  demands. 

Figure  22.  shows  how  MASE  minimum,  maximum  and  average  values  of  eaeh 
foreeast  method  ehange  as  CV  inereases. 
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Figure  22.  MASE  Values  per  Foreeast  Method 


Simple  Average  Moving  Average 
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All  the  charts  utilize  exponential  vertical  axes  to  capture  the  entire  spectrum  of  possible 
results.  The  horizontal  axes  correspond  to  the  clusters  of  coefficient  of  variation. 
Maximum  and  minimum  values  provide  the  idea  of  the  risk  involved  in  the  selection  of  the 
method  as  a  fixed  solution. 


All  forecast  methods  considered  in  the  model  generate  similar  shapes  of 
maximum  and  average  curves.  However,  ARRSES  and  NAVY  methods  are  capable  of 
generate  the  lowest  minimum  values,  thus  spreading  the  range  of  possible  values  by 
allowing  significantly  accurate  forecasts  at  high  values  of  CV. 
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The  similarity  of  accuracy  curves’  shapes  shows  that  there  is  no  evidence  that  the 
selection  of  forecast  method  according  to  low  or  high  variability  will  represent  in  any 
accuracy  improvement.  That  similarity  is  partially  explained  by  the  fact  that  the  forecast 
methods  used  in  the  model  are  classified  as  quantitative  and  time  series.  Hence,  they  are 
all  based  on  the  same  assumption  of  demand  stationarity,  as  they  use  historical  data  to 
predict  future  values.  Furthermore,  time  series  forecast  methods  can  be  considered 
responsive  or  smooth,  depending  on  the  parameters  used.  SA  is  a  smooth  method  by 
nature,  while  the  k,  a  and  P  values  used  respectively  in  MA,  SES  and  ARRSES,  made 
them  behave  as  smooth  methods  as  well.  Combination  method  can  also  be  considered 
smooth  as  it  averages  the  forecasts  of  previous  four  methods.  NAVY  method  is  the  most 
responsive  in  the  model,  as  it  uses«  =  0.3  . 

Figure  23.  shows  the  six  MASE  average  curves  together,  corresponding  to  the 
forecast  methods  applied  in  the  model,  to  evidence  the  similarity  in  terms  of  forecast 
accuracy  values. 

Figure  23.  Average  MASE  Results  by  Forecast  Method 
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The  vertical  axel  is  comprised  by  MASE  values  and  was  intentionally  cut  at  2.0,  as  the 
values  continue  to  increase  and  values  higher  than  1.0  are  considered  worse  than  naive 
method.  For  low  CV  values,  accuracy  results  are  similar,  but  they  tend  to  diverge  as  CV 
increases. 
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In  order  to  optimize  the  quality  of  forecast  results,  we  can  apply  the  average 
MASE  value  of  1 .0  as  a  threshold  to  consider  that  the  use  of  one  specific  forecast  method 
is  recommendable,  because  it  is  capable  of  outperforming  the  naive  method 
systematically.  For  items  with  higher  values  of  CV,  deeper  attention  is  needed  to  support 
the  forecasting  process. 

Applying  that  threshold,  we  found  that  none  of  the  forecast  methods  used  in  the 
model  has  systematic  superior  performance  than  naive  method  for  CV  values  higher  than 
1 .6,  while  all  of  them  can  outperform,  on  average,  the  naive  method  for  CV  values  lower 
than  1.6.  Hereafter,  we  will  refer  to  the  range  of  0  <  CV  <  1.6  as  the  “selected  data”. 

Figure  24.  shows  the  same  results  as  in  Figure  23.  ,  but  in  a  different  scale,  as  its 
MASE  values  are  limited  to  1.0. 

Figure  24.  Average  MASE  Results  in  the  Selected  Data 
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Within  the  range  of  CV  that  the  forecast  methods  can  be  used  to  outperform  the  naive 
benchmark,  NAVY  method  is  systematically  considered  the  best  option. 

Although  the  NAVY  method  had  better  performance  in  all  clusters  of  CV  in  the 
selected  data,  we  identified  a  risk  in  using  a  fixed  forecast  method  for  a  group  of  items. 
Hence,  we  investigated  the  potential  benefit  on  accuracy  when  the  most  accurate  forecast 
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method  is  selected  for  each  item,  what  we  call  as  Flexible  Method,  instead  of  working 
with  a  fixed  method. 

Figure  25.  shows  that  the  adoption  of  Flexible  Method  in  the  selected  data 
resulted  in  a  significant  gain  of  accuracy,  when  compared  with  each  one  of  the  forecast 
methods  applied  individually. 

Figure  25.  Accuracy  Gain  of  Flexible  Method 
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The  bars  represent  the  average  of  MASE  results  for  items  with  CV  <  1.6. 

Additionally,  Figure  26.  shows  that  the  Flexible  Method  not  only  has  superior 
accuracy  than  the  NAVY  method,  which  was  considered  the  most  accurate  among  the  six 
methods  applied  in  the  selected  data,  but  it  is  capable  of  extending  the  range  of  CV  in 
which  it  can  be  used  to  systematically  outperform  naive  benchmark. 
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Figure  26.  MASE  Values  of  NAVY  and  Flexible  Method 
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Flexible  Method  resulted  in  a  signifieantly  superior  accuracy  in  all  clusters  of  CV  in  the 
selected  data.  Additionally,  it  generates  average  MASE<\  for  the  CV  cluster  (1. 6-2.0), 
what  extend  the  overall  range  of  CV  values  in  which  the  use  of  time  series  forecast 
methods  is  expected  to  outperform  naive  the  method. 


Therefore,  eonsidering  the  data  used,  the  adoption  of  Flexible  Method  represented 
a  signifieant  gain  in  foreeast  aeeuraey  as  well  as  an  extension  in  the  number  of  items  that 
time  series  foreeasts  were  eonsidered  reeommendable.  Implementing  our  findings,  all  six 
time  series  foreeast  methods,  if  applied  as  a  fixed  solution,  were  reeommendable  for 
17,437  items,  what  represents  57.22%  of  the  trimmed  data.  Meanwhile,  utilizing  the  same 
eriteria,  the  Flexible  Method  is  eonsidered  reeommendable  in  22,256  items,  thus 
representing  73.04%  of  the  trimmed  data. 

F.  CHAPTER  SUMMARY 

After  applying  a  model  that  ealeulates  demand  foreeasts  and  aeeuraey  values  in 
all  items’  data,  the  most  signifieant  findings  were: 

•  Despite  the  methodologie  differenees  and  theoretieal  superiority  of  MASE 
over  CIMIPi*,  both  generated  a  very  high  level  of  agreement,  while 
seleeting  the  most  aeeurate  foreeast  method; 

•  The  ealeulation  of  foreeast  aeeuraey  ean  be  used  by  the  foreeasters  as  a 
managerial  tool,  instead  of  just  fulfilling  the  need  of  reporting; 
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•  In  order  to  provide  information  that  helps  to  improve  the  foreeasting 
proeesses,  aeeuraey  has  to  be  ealeulated  at  the  item  level; 

•  All  foreeast  methods  applied  in  the  model  tend  to  be  less  accurate  than  the 
naive  method,  as  CV  increases; 

•  Using  averages  of  MASE  values,  the  NAVY  was  considered  the  most 
accurate  of  all  six  forecast  methods  used  in  the  model  for  all  clusters  of 
CV<  1.6. 

•  The  use  of  Flexible  Method  resulted  in  a  significant  gain  of  accuracy, 
when  compared  to  any  of  the  other  forecasting  methods  applied 
individually. 
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V.  FINDINGS,  RECOMMENDATIONS  AND  FUTURE 

RESEARCH 


A.  FINDINGS 

While  many  of  our  findings  throughout  this  research  are  detailed  at  the  point  of 
discussion,  the  major  findings  of  our  research  in  regard  to  CIMIPf  and  forecast  accuracy 
measurement  are  summarized  below  for  ease  of  access. 

1.  C/M/P/ Weaknesses 

CIMIPf  is  not  able  to  produce  accuracy  results  for  individual  line  items  when  the 
actual  demand  for  that  item  during  the  period,  usually  one  year,  is  zero.  This  complicates 
the  individual  line  item  assessment  of  forecast  accuracy,  since  CIMIPf  returns  an  invalid 
division-by-zero  result.  This  weakness  does  not  prevent  the  aggregation  of  results  for 
multiple  line  items  because  of  the  summation  that  occurs  in  the  denominator  prior  to  the 
final  calculation. 

C/M/P/ results  are  significantly  affected  by  the  unit  costs  that  are  included  in  both 
the  numerator  and  denominator  of  the  equation.  The  inclusion  of  unit  cost  as  an 
independent  variable  in  CIMIPf  detracts  from  the  primary  purpose  of  measuring  forecast 
accuracy  performance. 

CIMIPf  produces  aggregated  results  that  are  not  inherently  intuitive  and  are 
disproportionately  affected  by  over-estimations.  This  is  especially  evident  with  low 
demand  items  where  the  possibility  of  the  size  of  the  error  exceeding  actual  demand  is 
greater.  We  found  that  the  aggregate  CIMIPf  for  28,235  low  demand  items  produced  a 
large  negative  result  (-314%),  while  the  aggregate  C/M/P/ for  15,690  high  demand  items 
produced  a  modest  positive  result  (58%).  As  another  example  of  the  effect  of  unit  cost, 
due  to  the  high  dollar  weighting  for  the  high  demand  group  the  total  CIMIPf  result  was 
48%. 

CIMIPf  does  not  consider  the  difficulty  of  accurately  forecasting  the  entirety  of 

material  that  the  services  and  DLA  are  charged  with  managing.  Its  lack  of  a 

benchmarking  function,  similar  to  the  one  found  in  MASE,  results  in  CIMIPf  directly 
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comparing  the  forecasting  performance  of  the  services  and  DLA  against  each  other. 
Although  we  did  not  compare  the  performance  of  the  Navy  versus  DLA,  without 
consideration  of  performance  benchmark,  the  services  could  be  penalized  for  what  is 
considered  to  be  poor  performance  or  incentivized  to  make  risky  decisions  in  an  effort  to 
improve  forecasting  performance. 

2.  Forecast  Accuracy 

There  has  been  significant  study  on  the  topic  of  forecast  accuracy  within  the 
academic  world.  Among  a  large  amount  of  forecast  accuracy  metrics  currently  available 
in  literature,  MASE  was  considered  useful  and  theoretically  superior  than  all  variants  of 
CIMIP. 

From  the  perspective  of  IM’s  at  WSS,  the  measurement  of  accuracy  at  the  item 
level  generates  more  value  than  one  aggregated  accuracy  number,  as  currently  required 
by  DOD. 

Item  accuracy  measurements  enable  a  better  identification  of  poorly  forecasted 
items  and  can  also  be  applied  as  a  managerial  tool  for  determining  which  forecast  method 
to  utilize. 


3.  Demand  Forecasting 

The  task  of  demand  forecasting  within  the  DOD  is  very  complex  because  demand 
patterns  are  significantly  heterogeneous.  Using  MASE  as  the  forecast  accuracy 
measurement,  we  found  that  the  Navy’s  preferred  forecasting  method,  on  average,  out¬ 
performed  the  other  five  methods  when  compared  to  the  naive  method  and  when  CV  was 
less  than  1.6.  Additionally,  flexibility  in  the  choice  of  forecasting  method  at  the 
individual  item  level,  enabled  our  test  data  to  outperform  the  naive  method  when  CV  was 
less  than  2.0. 

B,  RECOMMENDATIONS 

1,  DOD 

The  following  are  recommendations  for  the  DOD  to  improve  demand  forecasting: 
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a.  Replace  CIMIPj  with  MASE  as  the  Aggregate  Forecast  Accuracy 
Measurement  of  Record 

As  we  have  shown,  MASE  is  superior  to  CIMIPf  in  its  ability  to  provide  intuitive 
results  aeross  more  demand  patterns,  while  also  avoiding  distortions  from  unit  cost  and 
demand  volume.  The  built  in  benchmarking  of  the  MASE  equation  will  also  enable  the 
DOD  to  more  accurately  assess  the  forecasting  performance  of  the  services  and  DLA. 

b.  Consider  the  Naive  Method  as  a  Basis  for  Department  Benchmarks 

Direct  comparison  of  demand  forecasting  performance  between  the  services  and 
DLA  using  an  absolute  error  metric,  such  as  CIMIPf,  does  not  consider  the  difficulty  of 
forecasting  for  the  unique  materiel  populations.  A  department-wide  goal  that  arbitrarily 
declares  a  certain  accuracy  percentage  as  acceptable  does  not  accurately  reflect  the 
complexity  of  the  task  and  has  the  potential  to  drive  counter-productive  behavior  in  an 
effort  to  reach  the  goal.  A  better  measure  of  demand  forecasting  performance  would 
utilize  a  benchmarked  metric,  such  as  MASE,  and  then  set  the  standard  as  outperforming 
the  benchmark.  In  the  case  of  MASE,  which  uses  the  naive  method  as  a  benchmark,  this 
would  encourage  the  services  and  DLA  to  attain  an  aggregate  forecast  accuracy  score 
equal  to  or  less  than  some  number  less  than  one. 

2,  Navy 

The  following  are  recommendations  for  the  Navy  to  improve  demand  forecasting: 

a.  Transition  to  Flexible  Forecasting  Methods  at  the  Item  Level 

As  we  have  shown,  the  Navy’s  current  forecasting  method  of  exponential 
smoothing  with  backcasting  outperforms  the  naive  method  on  average  when  the  CV  is 
less  than  1.6.  If  NAVSUP’s  forecasters  had  flexibility  in  their  choice  of  forecasting 
method,  then  on  average,  they  would  be  able  to  select  an  analytical  forecasting  method 
that  outperformed  the  naive  method  when  the  CV  of  an  item  is  less  than  2.0.  The 
complexity  of  generating  accurate  demand  forecasts  for  such  a  diverse  set  of  items  does 
not  lend  itself  to  using  only  one  analytical  forecasting  method.  As  the  ERP  program 
improves  its  capabilities,  the  Navy  would  benefit  from  more  flexibility  in  its  forecasting 
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methods.  The  ideal  approach  would  be  to  apply  multiple  forecast  methods  to  the 
historical  data  of  each  line  item  and  then  choose  the  forecast  method  that  optimizes  the 
MASE  result,  or  whichever  accuracy  metric  the  Navy  utilizes. 

b.  Utilize  MASE  to  Analyze  Forecast  Accuracy  at  the  Item  Level 

MASE  has  advantages  over  both  the  CIMIPf  and  EASE  equations  and  utilizing  it 
as  a  forecast  accuracy  measurement  will  enable  WSS  to  better  identify  specific  line  items 
that  have  not  been  well  forecasted  over  time  even  when  actual  demand  is  zero. 

c.  Publish  a  NA  VS  UP  Demand  Forecasting  Procedures  Instruction 

During  the  course  of  our  research  we  could  not  locate  a  NAVSUP  instruction  that 
detailed  the  procedures  that  WSS  shall  use  to  generate  demand  forecasts  for  all  of  the 
various  situations  and  how  to  measure  those  results.  While  there  are  internal  business 
rules  and  other  technical  ERP  documents,  an  instruction  of  this  type  would  ensure  a 
broader  understanding  of  demand  forecasting  across  the  Navy  and  open  up  the  process 
for  constructive  criticism  that  could  lead  to  improved  results. 

C.  AREAS  FOR  FUTURE  RESEARCH 

The  challenge  of  accurately  forecasting  demand  across  the  DOD  is  not  a  simple 
matter  and  the  recommendations  we  have  offered  here  are  not  likely  to  solve  all  of  the 
issues  that  prevent  the  DOD  from  improving  forecast  performance.  During  the  course  of 
our  research  we  looked  at  many  segments  of  this  issue  that  we  did  not  have  the 
opportunity  to  explore  further.  Some  of  these  ideas  may  generate  constructive 
improvements  while  others  may  not.  The  following  are  non-mutually  exclusive  ideas  that 
we  feel  deserve  further  study  in  order  to  improve  demand  forecasting  within  the  Navy 
and  DOD. 

1.  Item  Manager  Discretion  to  Adjust  ERP  Derived  Forecast 

In  our  discussions  with  NAVSUP  we  learned  that  after  ERP  develops  demand 
forecasts  using  the  exponential  smoothing  with  backcasting  method  these  forecasts  are 
subject  to  IM  review  and  possible  adjustment.  We  feel  that  it  would  be  worthwhile  to 
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compare  the  effectiveness  of  the  IM  adjusted  foreeasts  to  the  original  ERP  developed 
foreeast.  A  comparison  of  the  actual  demand  data  to  the  original  and  adjusted  foreeasts 
should  reveal  if  the  IM  adjusted  forecasts  result  in  more  or  less  aecurate  forecasts  than 
the  original  ERP  derived  forecast.  The  seope  of  this  research  could  examine  all  ECI’s  or 
just  a  speeifie  ECI-subset,  sinee  NAVSEIP  uses  different  foreeasting  methods  to  generate 
foreeasts  for  eaeh  LCI  group 

Additionally,  surveys  of  the  IM’s  eould  determine  the  leading  reasons  for 
adjusting  an  ERP-derived  foreeast.  A  eomparison  of  these  IM  provided  reasons  with  the 
aetual  foreeast  performanee  eould  help  determine  whieh  reasons  generally  result  in  more 
accurate  foreeasts  and  whieh  generally  result  in  less  aeeurate  foreeasts.  If  the  human 
survey  portion  is  ineluded,  the  NPS  researeher  would  need  to  attain  permission  from  the 
human  researeh  proteetion  program  offiee  and  the  institutional  review  board.  A  study  of 
this  kind  would  also  require  the  full  support  of  NAVSUP  and  aceess  to  the  IM’s. 

2.  Explore  the  Use  of  Retail  Level  Demand  in  Forecast  Development 

To  develop  demand  foreeasts,  NAVSUP  uses  quarterly  wholesale  level  demand 
over  a  five-year  period.  While  this  data  provides  a  good  proxy  for  aggregated  retail 
demand  and  is  easier  to  obtain,  it  also  results  in  less  frequent  demand  oeeurrenees  and 
eould  hide  demand  patterns.  Although  retail  level  demand  can  be  ehallenging  to  organize 
and  interpret,  it  may  provide  a  better  data  set  to  generate  demand  foreeasts.  In  multi- 
eehelon  supply  chains,  demand  information  from  the  end  user  level  must  be  tracked  in 
order  to  mitigate  the  negative  impaets  of  the  bullwhip  effeet.  When  demand  variability  at 
the  retail  level  is  eombined  with  a  laek  of  eommunieation  up  the  supply  ehain,  exeess 
inventory  is  likely  to  form  at  all  levels.  CIMIP  has  addressed  inventory  visibility 
ehallenges,  but  sharing  of  end-eustomer  demand  information  ean  also  help  to  reduee 
unnecessary  inventory.  We  propose  an  analysis  of  whether  properly  trimmed  retail  level 
demand  can  provide  a  better  demand  foreeast  for  items  that  have  traditionally  been 
diffieult  to  foreeast  with  only  wholesale  level  demand. 
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3,  Explore  Alternatives  to  Managing  Material  by  Life  Cycle  Indicator 

The  Navy  currently  uses  LCI’s  from  one  to  six  to  segregate  material  based  on  the 
maturity  of  the  parent  program  that  it  supports.  Initially  for  LCI-1,  when  demand  is  non¬ 
existent,  engineering  estimates  are  used  to  develop  forecasts.  As  the  item  progresses  to 
the  next  LCI  categories  these  engineering  estimates  begin  to  factor  in  observed  demand 
in  order  to  develop  forecasts.  By  the  time  an  item  is  classified  as  an  LCI-4  or  -5  the 
analytical  forecast  is  based  solely  on  observed  demand.  While  in  general  this  makes 
sense,  it  may  be  possible  that  items  could  be  more  effectively  managed  and  forecasted  if 
they  were  placed  into  groups  based  on  other  criteria,  instead  of  their  parent  programs’  life 
cycle.  We  propose  a  study  to  determine  what  these  more  effective  sorting  criteria  are  and 
how  best  to  employ  them. 

4,  Time  Periods  and  Fractions 

The  Navy  currently  uses  five  years  of  wholesale  level  demand,  sorted  into  20 
quarterly  buckets,  to  generate  a  single  number  demand  forecast  for  the  21®*  quarter.  To 
obtain  a  12-month  forecast  the  quarterly  forecast  number  is  multiplied  by  four.  This 
single  number  is  not  always  a  whole  integer.  We  propose  a  study  of  the  effect  of  using 
different  time  buckets  (days,  weeks,  months,  etc.),  different  historical  time  periods  (1,  3, 
7,  etc.  years)  and  the  treatment  of  fractional  demand  forecasts  (round  up,  round  down,  no 
rounding,  etc.)  to  potentially  generate  more  accurate  forecasts. 

5,  Investigate  the  Use  of  Alternative  Forecasting  Methods 

The  mathematical  model  presented  in  Chapter  IV  aims  to  generate  improvement 
in  forecast  accuracy.  However,  it  is  not  sufficient  to  select  methods  with  the  best  MASE 
values  throughout  the  entire  curve,  disregarding  the  fact  that  they  can  be  worse  than  the 
naive  method.  That  method  is  considered  to  be  a  rudimentary  prediction  tool  and  still 
systematically  outperforms  the  simple  forecast  methods  used  in  this  research  for  items 
with  CV  >2.0.  While  we  cannot  recommend  its  blanket  utilization  for  those  items,  we 
propose  an  investigation  of  the  potential  benefits  of  using  either  more  complex  time- 
series  forecasting  methods  or  alternative  forecasting  methods  such  as  causal,  qualitative, 
and  expert  estimates. 
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6,  Analyze  the  DOD  Bias  Metric 

The  initial  concerns  of  Congress  and  GAO,  in  dealing  with  the  issue  of  excessive 
secondary  inventory,  seemed  to  be  more  focused  on  reducing  the  bias  to  over-forecast 
instead  of  improving  forecast  accuracy.  While  the  focus  today  seems  to  have  shifted 
away  from  bias  toward  accuracy,  there  is  still  a  requirement  to  measure  bias  in 
forecasting.  The  DOD  business  rules  that  defined  the  accuracy  metric  also  laid  out  the 
procedures  for  utilizing  the  bias  metric.  As  we  have  discussed,  our  research  centered  on 
the  accuracy  metric,  but  the  bias  metric,  as  defined  in  Equation  (2.25),  could  also  benefit 
from  a  further  analysis  of  its  strengths  and  weaknesses. 

7,  Portfolio  Theory  Approach 

Portfolio  theory  indicates  that  an  investor  can  optimize  the  trade-off  between  risk 
and  reward  through  diversification.  If  we  apply  that  rationale  to  the  flexible  forecasting 
model,  better  results  are  possible  when  the  pool  of  forecasting  methods  reflects  a  large 
spectrum  of  responsiveness,  and  is  comprised  of  specific  methods  to  deal  with  trends, 
seasonality  and  intermittent  demand.  We  propose  an  investigation  of  the  benefits  of 
applying  a  portfolio  theory  rationale  to  the  flexible  forecasting  model. 

8,  Grouping  Method 

In  our  research,  we  grouped  items  into  CV  clusters  as  an  attempt  to  identify 
methods  that  are  expected  to  outperform  others  for  a  particular  range  of  variability. 
However,  that  grouping  method  was  not  able  to  segregate  items  in  a  way  that  one  specific 
forecasting  method  outperformed  the  others.  We  acknowledge  the  possibility  of  grouping 
items  in  different  ways,  like  demand  patterns,  clusters  of  unit  costs,  clusters  of  dollar 
demand,  etc.  However,  forecasting  method  selection  at  the  item  level  is  more  likely  to 
produce  more  accurate  forecasts  than  any  other  kind  of  grouping.  Individualized  forecasts 
are  likely  to  require  significantly  more  effort,  so  we  propose  an  analysis  to  determine  if 
this  additional  effort  at  the  item  level  pays-off,  in  terms  of  marginal  gains  in  accuracy. 
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9,  Optimization  of  Parameters 

Parameters  used  to  initiate  the  ealeulations  of  forecast  values  in  each  of  the 
methods  that  we  tested  were  arbitrarily  chosen.  The  intent  of  our  research  was  to  uncover 
potential  opportunities  of  improvement  by  applying  a  flexible  forecasting  model.  We 
propose  further  investigation  of  the  results  generated  if  the  parameters  were  optimized  for 
each  item. 


10,  Apply  Statistical  Tools  to  Generalize  Results 

During  our  analysis  of  the  DOD’s  accuracy  metric,  we  utilized  quick, 
hypothetical  tests  to  uncover  evidence  of  inherent  flaws  within  CIMIPf.  The  simplicity  of 
these  tests  unfortunately  means  that  the  findings  are  not  supported  by  any  statistical 
analysis  and  cannot  be  generalized  to  larger  datasets.  Therefore,  we  propose  statistical 
analyses  on  the  impacts  of  the  CIMIPf  flaws  that  we  identified. 
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